DECIPHERMENT ALGORITHMS by B.V. Sukhotin translated and adapted by Jacques Guy "Our knowledge of language X" is twofold. It is the knowledge we use to translate the signs of X into their respective referents in the physical and ideal worlds. It is also the knowledge we use to identify some properties of the signs of X without resorting to their referents. It is for instance the knowledge of English by which, presented with the sentence "I was charded by its unflemming thurn", we identify "charded" as the past participle of "to chard", "unflemming" as the opposite of "flemming", itself derived from "to flem", and "thurn" as a substantive. It is also the knowledge by which we recognize three syllables and ten letters in "unflemming". In general terms, it is the knowledge by which the phenomena of X are identified. Our knowledge of language X in either acceptation, however, is useless for dealing with language Y, unless X and Y have at least some features in common. Thus a knowledge of Dutch is useful to understand Hawaiian only insofar as that they have some, if few, features in common: both use the Roman alphabet, and common letters are pronounced rather alike. On the other hand, a knowledge of English is utterly useless for Chinese: armed with it we can tell what "goes on" in an English sentence, but nothing in a Chinese sentence, not even where words end and start (there are no spaces to indicate breaks between words), not even if it is language at all. An algorithm which identifies a linguistic phenomenon provides a convenient definition of that phenomenon in the general form: "Phenomenon P is the one identified by algorithm A when applied to any text in any language". We shall call such an algorithm a "decipherment algorithm". Definitions of linguistic phenomena by their decipherment algorithms have very desirable properties: 1) the definitions, valid for all languages, are UNIVERSAL, 2) being algorithms, they are PRECISE, 3) finally, they are EFFICACIOUS, for they allow those phenomena to be recognized without fail if they occur in any particular language. The importance of decipherment algorithms for linguistic theory, then, can hardly be overemphasized -- if they exist. If decipherment algorithms do exist, some of their properties can be inferred: If an algorithm B needs information produced by another algorithm A, it is dependent upon A, and the phenomenon P(B) which it defines cannot be identified until the phenomenon P(A) defined by A is identified. For a set of decipherment algorithms to be efficacious, its elements must be efficacious. If an algorithm A depends for all its information upon another algorithm B, which in turn needs all its information from A, then the phenomena which they define cannot be identified. Therefore A and B are not efficacious and must be rejected. For a set of decipherment algorithms to be efficacious, it must be at least effective, and therefore finite. Therefore there must exist at least one algorithm in the set which needs no input from any another. One can imagine the following hierarchy of decipherment algorithms in an efficacious set applying to the written word: 1) an algorithm to identify the individual symbols on the printed page, 2) an algorithm to classify those symbols, for instance, grouping all occurrences of 'a' in the same category, regardless of their position in the text, and non-distinctive variations (e.g. the fonts used), 3) an algorithm to identify the boundaries of the words, 4) an algorithm to classify those words into grammatical categories, 5) an algorithm to parse the text on the basis of the grammatical categories to which its words belong, and so on. Although the feasibility of algorithm (1) cannot be doubted, none has been proposed to date. The feasibility of algorithms (3) to (5), on the other hand, can be doubted, and this is why some algorithms will be elaborated and discussed here, which should go some way towards dispelling doubts about the feasibility of high-order decipherment algorithms. DISTINCTIVE FEATURES It has been seen that the definition of a linguistic phenomenon and the decipherment algorithm by which it is identified are one and the same. The list of the distinctive features of a linguistic phenomenon (that is, the properties of this phenomenon which enable us to identify its occurrences in a text) is also a definition of this phenomenon. But whereas the decipherment algorithm that identifies a given phenomenon is not necessarily unique, the list of its distinctive features is. In this respect, a definition of a linguistic phenomenon consisting of the list of its distinctive features is more general than one consisting of its decipherment algorithm (or one of its decipherment algorithms). Lists of distinctive features, however, generally do not have the desirable properties of decipherment algorithms, least of all efficacy. There are two kinds of distinctive features: text-dependent and text-independent. THE SET OF ACCEPTABLE SOLUTIONS A list of text-independent features specifies a set of possible interpretations of a text, ANY text. Those possible interpretations constitute the set of ACCEPTABLE SOLUTIONS to the problem of understanding the text. Text-independent features are therefore definable without computation on any text. THE OBJECTIVE FUNCTION How well each of those acceptable solutions fits any particular text is determined by properties of this text, that is, by text-dependent features. "How well" is then an objective function on the set of acceptable solutions. The set of acceptable solutions and the objective function must be defined so that the phenomenon which they identify should correspond to an element of the set of acceptable solutions for which the objective function is minimum or maximum. The objective function is computed from the text analyzed. Definitions of linguistic phenomena by distinctive features therefore, at this stage, comprise two parts: 1) a set of acceptable solutions 2) an objective function. The two parts must provide enough information for devising a procedure which identifies the acceptable solution(s) for which the objective functions reache an extreme. This identification procedure is the decipherment algorithm which constitutes the third part of the definition of the phenomenon. EQUIVALENCE AND CORRECTNESS: MATHEMATICAL VS LINGUISTIC Given a set of acceptable solutions and an objective function, a large number of algorithms may be devised which find an acceptable solution for which the objective function reaches an extreme. Such algorithms are MATHEMATICALLY EQUIVALENT. Two objective functions that reach the same extreme for the same element of a given set of acceptable solutions are LINGUISTICALLY EQUIVALENT. A definition of a linguistic phenomenon is MATHEMATICALLY CORRECT if its decipherment algorithm yields the absolute extreme for its objective function calculated for its set of acceptable solutions. A definition of a linguistic phenomenon is LINGUISTICALLY CORRECT if its set of acceptable solutions and its objective function are such that the (intuitively) right phenomenon is effectively identified. Those distinctions provide a basis of collaboration between mathematicians and linguists. It must be emphasized that almost all the algorithms to which we have had access are mathematically incorrect: they cannot be garanteed to find the extremes of objective functions. The results obtained are not linguistically perfect, which should be expected given the mathematical imperfection of the algorithms, even though some are of surprisingly high quality. We assume that the alphabet of a given language is known, that is, that we are able to identify all occurrences of a given letter and to establish their inventory. We also assume that the text is written in a phonemic, not syllabic or hieroglyphic, system. Remains to be found which letters correspond to vowels, and which to consonants. The algorithm works with sounds as well as with letters, so that the dichotomy letter/sound is not relevant here. Therefore, the definition of vowels and consonants will be valid for both the spoken and the written language. Let us first describe the set of acceptable solutions. SET OF ACCEPTABLE SOLUTIONS We consider vowels and consonants to be obtained by a partition of the alphabet into two subsets: V constituted by vowels, and C, constituted of consonants. The intersection of V and C is empty, and their union is the alphabet A. In some cases the intersection of V and C is not empty, since a letter can be a vowel as well as a consonant. Thus in Latin i and u are sometimes vowels, sometimes consonants, e.g.: iam que opus exegi In such cases, the model proposed seems to contradict intuition. Constructing a decipherment algorithm for the more general model is far more difficult. The information according to which vowels and consonants constitute a partition of an alphabet is of course not enough. If an alphabet consists of n letters, there are 2^n possible partitions of this alphabet, in the case of the 24 letters of the Latin alphabet, some 17 million solutions. Even so, such a definition does restrict to a certain extent the notion of vowel and consonant. OBJECTIVE FUNCTION It is obvious that, in any text, it is less probable that a vowel will appear after another vowel than after a consonant, as it is less probable that a consonant will appear after another consonant than after a vowel. On the contrary, it is quite frequent that a vowel and a consonant should appear next to each other. If we consider a random partition of an alphabet, it is unlikely that it should have such a property. An arbitrarily chosen partition loses this property the further it diverges from the correct partition. We shall describe the combinatorial properties of letters using a square matrix the rows and columns of which correspond to the letters of the alphabet. In the cell at the intersection of row x and column y we shall record the number of times a pair formed of letters x and y appears (without taking into account the order of occurrence of these letters) in a given text. Consider a given partition of an alphabet into two subsets. Move all the rows and columns corresponding to vowels into the "northwest cell" of the matrix and draw the limits separating the vowels from the consonants. We have a matrix of the type: . vowels consonants . .-----------------------. . vowels | 1 | 2 | . |-----------+-----------| . consonants | 4 | 3 | . '-----------------------' Cell #1 will contain the number of combinations of vowels together, cell #3 the same information for consonants, and the number of combinations between consonants and vowels will be found in cells #2 and #4. If the partition is close to the correct partition, then, according to our hypothesis, the numbers in 1 and 3 will be small, and those in 2 and 4 will be large. The worth of a partition can then be estimated, for instance, from the sum of the numbers in 1 and 3 (the sum of the whole matrix being constant and equal to twice the number of the letters of the whole text). The smaller the sum of 1 and 3, the better the partition; the minimum of the sum corresponds to the best partition. The process of defining vowels and consonants is now complete: we can say that vowels and consonants form subsets of a partition of the alphabet into two classes such that the number of homogeneous pairs in a given text is minimum. Which subset corresponds to vowels and which to consonants remains to be determined. Now, it is extremely probable that the most frequent letter is a vowel; we can therefore believe that the subset which contains this letter corresponds to vowels. THE ALGORITHM The simplest procedure using the distinctive features mentioned is trivial. It is sufficient to calculate the matrix of absolute frequencies of digraphs, to examine each possible solution (that is, each partition into two subsets) and to calculate for each the value of the objective function. The partition which corresponds to the minimum of the function must then be chosen as the best, and the procedure terminates (in the case where there would be more than one partition corresponding to this criterion, all are retaines). Unfortunately, such an algorithm requires considerable calculations which are far beyond the capacity of current computers. We propose therefore a simple procedure which is practically equivalent to the previous one. It is be reminded that this procedure cannot guarantee that the minimum will be found in every case, but the probability of erroe seems fairly low. We shall represent this procedure in the form of a list of instructions: (1) Build for a given text the matrix [f(x,y)] where f(x,y) represents the absolute frequency of the pairs consisting of x and y, whatever the order of occurrence of x and y. (2) Erase the numbers of the form f(x,x). (This operation is justified by the fact that the sum of these numbers is constant whatever the partition, and is therefore not relevant to the calculation of the minimum of the objective function. These numbers would even have a disturbing effect owing to certain properties of the procedure) . _n_ . \ . (3) For each letter a, calculate the sum / f(a,ax) . -- . x=1 (* The algorithm effectively partitions a text into its morph components, not morphemes, but it will turn out to be far more powerful than that: it can be expected to do a morphological analysis of an unknown text. "Morph" should be read for "morpheme" whenever the latter occurs in the text *) We will assume that the text to be partitioned has no spaces to separate words and we also assume that the spelling used is phonemic. (* The first assumption makes the problem more general and somewhat more difficult. There is no need for the other: the algorithm will apply to strings of ideograms as well as to strings of phonemes equally successfully *) As in the case of the consonant/vowel algorithm, we start by defining the set of acceptable solutions. SET OF ACCEPTABLE SOLUTIONS A partition of a text into disjoint subsets is an acceptable solution. A few definitions need to be introduced before going further. A text is a mapping of an alphabet A (a set of letters) onto a set of locations L. In other words, a text is a set of 2-tuples of the type (a,p) where a is a letter and p its position in the text. A string Str(s,t) (where s and t are integer numbers and s>=0) is a set of 2-tuples of the type {(a[i],p[j])} in which sn one can associate a string Str[i+1](t,v). Str(s,t) and Str(v,w) have a non-empty intersection if s