Features and Feature Sets

Tiberias analyzes your text classes by comparing and measuring linguistic data in the texts that you select.

Every individual linguistic datum that Tiberias measures and weighs is called a feature.

These linguistic data, or features, fall into five categories of linguistic markers, or feature sets.

For example, תורה is a feature in the feature set Words. Adjective is a feature in the feature set Morphology.

Feature Set Description
Words

The unpointed (simply the spelling of the word with letters alone) and unvocalized forms of the words, much as they would appear in a Torah scroll.

When this feature set is selected, דוד and דויד, for example, appear as two distinct features.

Disambiguated Words

In biblical Hebrew words that are identical in meaning can be spelled differently when fully vocalized. Instances of this include plene (ketiv maleh) vs. deficient (ketiv chaser) spellings of words (e.g. אנכי and אנוכי), or words such as the name דוד, which sometimes begins with a dalet forte (Heb. dagesh) and sometimes without.

This feature set combines such like terms when displaying words.

When Tiberias displays disambiguated words it displays all the forms of the word that have been combined. Generally, Tiberias relies upon the linguistic tagging employed by the ETCBC IV database. The tagging employed to create the feature set of Disambiguated Words is an innovation of Tiberias’ staff of developers.

Lexemes This option combines as a single feature all conjugations of a given verb (e.g. לקח, ויקחו), and all forms of a given noun (e.g. איש, אנשים, לאנשים).
Syntax

There are several dozen types of syntax clauses. Every verse in the ETCBC IV database is tagged according to its constituent syntax clauses. These are displayed in abbreviated form.

To see examples of the syntax clause displayed as features, click on the syntax clause. All of the examples of that syntax clause in the text classes you have defined are displayed.

Morphology

Every morphological feature of a verse is tagged by the ETCBC IV database. These are displayed in abbreviated form.

To see examples of the morphological feature displayed as features, click on the morphological feature. All of the examples of that morphological feature in the text classes you have defined are displayed.

How to determine which feature sets to select

Selecting and combining feature sets provides you with all the options you need to design your experiment in a way that can be fine-tailored. Depending on the needs of your experiment you will want to employ the feature sets in different combinations.

Your aim is to select feature sets that will yield as high an expected accuracy percentage as possible in the analysis of the classes. This will enable Tiberias to classify your test text with the greatest accuracy and the highest confidence.

Sometimes a single feature set will yield a very high expected accuracy. Sometimes, this will be best achieved by combining two or more feature sets. Three of the feature sets concern lexical features—Disambiguated Words, Words, and Lexemes. As a rule you will never want to select all three, or even any two of them, as they contain much overlapping data and many overlapping words. Doing so would skew the results of your experiment. Instead you will want to select one of them as your base. Disambiguated words is the default lexical feature set established by the program.

Every change made in the constellation of feature sets and features you select impacts the expected accuracy of the algorithm in its analysis of the test text. Oftentimes, even removing a single feature can impact the expected accuracy. This is why every change in the constellation of feature sets or within the features selected themselves concludes by recalculating results. After modifying your selections of features and feature sets, you click Update on the View/Edit Features page to save your changes and recalculate expected accuracy.

Example: How to combine lexical feature sets

Let’s say that for your experiment you elect to stay with the default setting, disambiguated words. However, for the sake of your experiment, it is important to combine all conjugations of the verb אמר. To do this you will want to use the feature set of Lexemes for that verb. In this case, you would do the following:

  1. On the Experiment page > Analysis of Classes tab, click View/Edit Features.
  2. Under Active Categories, clear all category options and select only Lexemes.
  3. Clear the Feature box. This unselects all of the displayed lexemes.
  4. In the Search box, type אמר to access the lexeme אמר.
  5. Select the lexeme אמר.
  6. From the Active Categories bar, select Disambiguated Words. Because you have the term אמר in the search field, the display now shows all forms of the verb אמר from the disambiguated words as well as the lexeme אמר.

  7. Clear allof the forms of the verb אמר that are marked as disambiguated words, leaving only the lexeme אמר selected.

  8. To save your selections and recalculate results, click Update.
  9. On the Analysis of Classes page, click View/Edit Features again. You now see that all of the disambiguated words and all of the lexemes are displayed, but that all of the lexemes are unselected, except for the lexeme אמר. All of the disambiguated words appear as selected, with the exception of the features that are forms of the verb אמר.

 

To view and edit the features and features sets in your experiment, see View and Edit Features.