Informatics and Russian Transliteration

David J. Birnbaum, University of Pittsburgh

Professional Russianists traditionally employ either the United States Library of Congress (e.g., "Dostoevskii") or the "scholarly" or "linguistic" (e.g., "Dostoevskij") system for transliterating from Cyrillic to Latin (romanizing) in scholarly publications, and may also make use of a popular transliteration system (e.g., "Dostoevsky"), at least in undergraduate teaching. Although those of us who work with transliteration have all encountered ambiguities and insufficiencies in these and other systems on a random and ad hoc basis, neither Slavists nor information scientists have subjected the Library of Congress, scholarly, and other English–oriented romanization systems (e.g., US Board on Geographic Names, British Museum, Jakobson 1965) to scrutiny from an information science perspective (the best–known study of Russian romanization, Shaw 1967, makes important contributions, but pursues different goals than the present study, concentrating more on providing a practical introduction to established systems than to identifying criteria for their evaluation).

The proposed paper addresses the need for such a study of Russian romanization in the following ways:

1) It examines the goals of transliteration with attention to different audiences (e.g., Slavic scholars, undergraduate students, popular audiences) and functions (e.g., citation, library catalogs, sorting multilingual text).

2) It identifies at least two types of requirements and desiderata for transliteration system: a) those that depend on audience and function (such as the greater importance of signaling pronunciation to nonspecialists than to specialists) and b) those that are relevant to any system (such as uniqueness and reversibility of representations).

3) It criticizes several transliteration standards in wide use, as well as innovative proposals not in wide use, from the perspective of the audiences, functions, needs, and desiderata identified previously.

The study distinguishes at least three types of problem cases:

1) those that involve real and common impediments to the representation of information (e.g., Library of Congress "geroi" may be Nsg 'hero' or Npl 'heroes'),

2) those that can be disambiguated through outside knowledge (e.g., "i" in the Library of Congress system is not ambiguous between consonant letters, since "i kratkoe" can never occur interconsonantally), and

3) those that pose real but nonetheless very uncommon problems (e.g., Library of Congress "shch" overwhelmingly represents a single Cyrillic "shch" letter, rather than a sequence of "sh" plus "ch", but not in "vesnushchatyi" 'freckled').

The significance of the proposed paper is twofold. First, it applies basic informatic principles to the specific problems of Cyrillic romanization, creating for the first time a formal framework for the evaluation of transliteration systems. Second, it provides a systematic critique of existing systems according to these principles, highlighting and contrasting their strengths and weaknesses with a clarify and specificity that have not previously been brought to bear on these issues.