Skip to content

opendatakosovo/cyrillic-transliteration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

144 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

What is CyrTranslit?

A Python package for bi-directional transliteration of Cyrillic script to Latin script and vice versa.

By default, transliterates for the Serbian language. A language flag can be set in order to transliterate to and from Belarusian, Bulgarian, Greek, Montenegrin, Macedonian, Mongolian, Russian, Serbian, Tajik, and Ukrainian.

Note: Greek is also supported. While Greek uses its own alphabet and is not Cyrillic, it has been included due to user demand and shared transliteration needs.

What is transliteration?

Transliteration is the conversion of a text from one script to another. For instance, a Latin alphabet transliteration of the Serbian phrase "Мој ховеркрафт је пун јегуља" is "Moj hoverkraft je pun jegulja".

Citation

A citation would be much appreciated if you use CyrTranslit in a research publication:

Georges Labrèche. (2025). CyrTranslit (1.2.0). Zenodo. https://doi.org/10.5281/zenodo.17663256

BibTex entry:

@software{georges_labreche_nov2025,
  author       = {Georges Labrèche},
  title        = {CyrTranslit},
  month        = nov,
  year         = 2025,
  note         = {{A Python package for bi-directional 
                   transliteration of Cyrillic script to Latin script
                   and vice versa. Supports transliteration for Belarusian, 
                   Bulgarian, Greek, Montenegrin, Macedonian, Mongolian,
                   Russian, Serbian, Tajik, and Ukrainian.}},
  publisher    = {Zenodo},
  version      = {1.2.0},
  doi          = {10.5281/zenodo.17663256},
  url          = {https://doi.org/10.5281/zenodo.17663256}
}

Advancing research

CyrTranslit is actively used as a reliable tool to advance research! Here's an incomplete list of publications for research projects that have relied on CyrTranslit:

Text Normalization, Unicode Perturbations & Robustness

Low-Resource NLP & Machine Translation

Serbian Language NLP (Topic Modeling, Sentiment, Lexicons, QA, Abuse Detection)

NLP Applications for Society, Government, and Political Analysis

Engineering, Software Systems, and Backend Development

Proceedings, Collections, and Meta-Documents

Addresses, Geocoding, and NLP

How do I install this?

CyrTranslit is hosted in the Python Package Index (PyPI) so it can be installed using pip:

python3 -m pip install cyrtranslit         # latest version
python3 -m pip install cyrtranslit==1.2.0  # specific version
python3 -m pip install cyrtranslit>=1.2.0  # minimum version

What languages are supported?

CyrTranslit currently supports bi-directional transliteration of Belarusian, Bulgarian, Greek, Montenegrin, Macedonian, Mongolian, Russian, Serbian, Tajik, and Ukrainian.

Language codes are based on ISO 639-1 standards. For Serbian, both sr (ISO 639-1 language code) and rs (ISO 3166-1 country code) are accepted:

>>> import cyrtranslit
>>> cyrtranslit.supported()
['bg', 'by', 'el', 'me', 'mk', 'mn', 'rs', 'ru', 'sr', 'tj', 'ua']

How do I use this?

CyrTranslit can be used both programatically and via command line interface.

Programmatically

Belarusian

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Прывітанне, свет!", "by")
"Pryvitanne, svet!"
>>> cyrtranslit.to_cyrillic("Pryvitanne, svet!", "by")
"Прывітанне, свет!"

Bulgarian

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Съединението прави силата!", "bg")
"Săedinenieto pravi silata!"
>>> cyrtranslit.to_cyrillic("Săedinenieto pravi silata!", "bg")
"Съединението прави силата!"

Greek

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Το χόβερκραφτ μου είναι γεμάτο χέλια", "el")
"To choverkraft moy einai gemato chelia"
>>> cyrtranslit.to_cyrillic("To choverkraft moy einai gemato chelia", "el")
"Το χόβερκραφτ μου είναι γεμάτο χέλια"

Montenegrin

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Република", "me")
"Republika"
>>> cyrtranslit.to_cyrillic("Republika", "me")
"Република"

Macedonian

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Моето летачко возило е полно со јагули", "mk")
"Moeto letačko vozilo e polno so jaguli"
>>> cyrtranslit.to_cyrillic("Moeto letačko vozilo e polno so jaguli", "mk")
"Моето летачко возило е полно со јагули"

Mongolian

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Амрагаа Сүнжидмаагаа гэсээр ирлээ дээ хө-хө-хө", "mn")
"Amragaa Sünjidmaagaa geseer irlee dee khö-khö-khö"
>>> cyrtranslit.to_cyrillic("Amragaa Sünjidmaagaa geseer irlee dee khö-khö-khö", "mn")
"Амрагаа Сүнжидмаагаа гэсээр ирлээ дээ хө-хө-хө"

Russian

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Моё судно на воздушной подушке полно угрей", "ru")
"Moyo sudno na vozdushnoj podushke polno ugrej"
>>> cyrtranslit.to_cyrillic("Moyo sudno na vozdushnoj podushke polno ugrej", "ru")
"Моё судно на воздушной подушке полно угрей"

Serbian

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Мој ховеркрафт је пун јегуља")
"Moj hoverkraft je pun jegulja"
>>> cyrtranslit.to_cyrillic("Moj hoverkraft je pun jegulja")
"Мој ховеркрафт је пун јегуља"

Tajik

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Ман мактуб навишта истодам", "tj")
"Man maktub navišta istodam"
>>> cyrtranslit.to_cyrillic("Man maktub navišta istodam", "tj")
"Ман мактуб навишта истодам"

Ukrainian

>>> import cyrtranslit
>>> cyrtranslit.to_latin("Під лежачий камінь вода не тече", "ua")
"Pid ležačyj kamin' voda ne teče"
>>> cyrtranslit.to_cyrillic("Pid ležačyj kamin' voda ne teče", "ua")
"Під лежачий камінь вода не тече"

Accented Characters (Macedonian & Bulgarian)

CyrTranslit supports Cyrillic characters with grave accents used in Macedonian and Bulgarian for homograph disambiguation and stress marking. By default, accents are stripped during transliteration for cleaner output. Use the preserve_accents parameter to preserve them.

Supported Accented Characters

Macedonian:

  • Ѐ/ѐ (U+0400/U+0450) - Cyrillic IE with grave

    • Purpose: Distinguishes homographs (e.g., нѐ "us" vs не "no", сѐ "everything" vs се "reflexive pronoun")
    • Standard: ISO 9:1968/1995, adopted by Macedonian Academy of Arts and Sciences (1970)
  • Ѝ/ѝ (U+040D/U+045D) - Cyrillic I with grave

    • Purpose: Distinguishes homographs (e.g., ѝ "her" vs и "and")
    • Standard: ISO 9:1968/1995

Bulgarian:

  • Ѝ/ѝ (U+040D/U+045D) - Cyrillic I with grave
    • Purpose: Stress marking and homograph disambiguation (e.g., ѝ "her" vs и "and")
    • Standard: ISO 9:1995

Sources:

Usage Examples

Default behavior (accents stripped):

>>> import cyrtranslit
>>> cyrtranslit.to_latin("ѝ је", "mk")
"i je"
>>> cyrtranslit.to_latin("нѐ сме", "mk")
"ne sme"
>>> cyrtranslit.to_cyrillic("i je", "mk")
"и је"

With accents preserved:

>>> import cyrtranslit
>>> cyrtranslit.to_latin("ѝ је", "mk", preserve_accents=True)
"ì je"
>>> cyrtranslit.to_latin("нѐ сме", "mk", preserve_accents=True)
"nè sme"
>>> cyrtranslit.to_cyrillic("ì je", "mk", preserve_accents=True)
"ѝ је"
>>> cyrtranslit.to_cyrillic("nè sme", "mk", preserve_accents=True)
"нѐ сме"

Command-line usage:

# Default (accents stripped)
$ echo "ѝ је" | cyrtranslit -l mk
i je

# Preserve accents
$ echo "ѝ је" | cyrtranslit -l mk --preserve-accents
ì je

Command Line Interface

Sample command line call to transliterate a Russian text file:

$ cyrtranslit -l RU -i tests/ru.txt -o tests/output.txt

Use the -c argument to accomplish the reverse, that is to input latin characters and output cyrillic.

Use the -h argument for help.

You can also omit the input and output files and use standard input/output

$ echo 'Мој ховеркрафт је пун јегуља' | cyrtranslit -l sr
Moj hoverkraft je pun jegulja
$ echo 'Moj hoverkraft je pun jegulja' | cyrtranslit -l sr
Мој ховеркрафт је пун јегуља

File Encodings

By default, input files are expected to be UTF-8. For files with different encodings, use the -e/--encoding parameter:

$ cyrtranslit -l BG -i file.txt -e windows-1251

If no encoding is specified and encoding fails with the default UTF-8, then CyrTranslit automatically tries the following common Cyrillic encodings: windows-1251, iso-8859-5, koi8-r, and cp866.

Try CyrTranslit by running it directly on the Python command line interface, e.g.:

>>> import sys
>>> import cyrtranslit.cyrtranslit
>>> sys.argv.extend(['-l', 'UA'])
>>> sys.argv.extend(['-i', 'tests/ua.txt'])
>>> sys.argv.extend(['-o', 'tests/output.txt'])
>>> cyrtranslit.cyrtranslit.main()
>>> exit()

How can I contribute?

Include support for other Cyrillic script alphabets. Follow these steps in order to do so:

  1. Create a new transliteration mapping file in the mapping/ directory (using the language code as the filename, e.g., xx.py) and reference to it in the TRANSLIT_DICT dictionary in mapping/__init__.py. If the language uses accented characters (like Macedonian and Bulgarian), create separate accented dictionaries (e.g., XX_CYR_TO_LAT_ACCENTED_DICT) following the pattern in mk.py or bg.py.
  2. Watch out for cases where two consecutive Latin alphabet letters are meant to transliterate into a single Cyrillic script letter. These cases need to be explicitly checked for inside the to_cyrillic() function in __init__.py.
  3. Add test cases inside of tests.py.
  4. Add test CLI input files in the tests directory.
  5. Update the documentation in the README.md.
  6. List yourself as one of the contributors.

Before tagging a release version and deploying to PyPI:

  1. Update the version and download_url properties in setup.py.
  2. Reserve a Zenodo DOI for the release and update this readme's Zenodo badge and citation instructions.

A big thank you to everyone who contributed: