Comparison of language identification models
Detecting the text language (often called language identification) is a common task when building machine learning systems. It takes text as an input and predicts the language given text is written in.
It’s a very useful functionality to have (e.g. storing texts in per language ElasticSearch index), but picking the library to use isn’t trivial at all.
This survey will give you up-to-date info on text detection libraries usable in production. Accuracy, language coverage, speed and memory consumption. Everything you need as an ML engineer to pick a library quickly. Oh, and a bonus. It’s reproducible and extensible so you can run it on your dataset.
- Why another survey?
- Benchmarked libraries
- Reported metrics
- Datasets
- Benchmark results
- Conclusion
- Reproducing the benchmark
- On performance measurement methodology
Why another survey?
Choosing a tool for detecting text language can be quite tricky. There are good surveys out there, but choosing the tool is difficult even after reading them. Reasons I’ve found it difficult:
- Different models support different sets of languages. Comparing numbers becomes very difficult.
- Running some model in production often means juggling between accuracy and runtime speed/memory constraints, which is rarely discussed in academic papers.
- There is no standardized dataset for comparison, and papers often compare with one or two competitors—it’s difficult to compare numbers.
- Models’ performances differ by language, but papers often publish just aggregated accuracy.
- Many methods never get a developer-friendly python package easy to install and use.
- Surveys go too broad.
- Very few papers actually provide the code to verify results.
I’ll try to fill that gap and make it easier to choose the language identification python package for your use case. More specifically, this survey:
- Benchmarks only packages that are practical to install and use.
- Runs benchmarks on several different datasets, all downloadable.
- Brings correctness metrics in aggregate, but also per language.
- Brings wall time, latency + throughput, and memory consumption.
- Does it all in a way so you can extend and reproduce it on your hardware (see GH repo).
Benchmarked libraries
Langdetect
Langdetect (pypi) is a python-port of Nakatani Shuyo’s language-detection library. By this point, it definitely represents an “old technique” for identifying the language, created way before (2010) the rise of neural networks in NLP. Despite the age, when published, it claimed to reach 99%+ accuracy on 49 supported languages. There is no paper published about it, but there is a presentation about it.
Langid
Langid (pypi) is a popular library coming from academia, but providing a really easy-to-use python package. It’s designed to be fast (see results), having minimal dependencies and being markup-agnostic, meaning it should perform reasonably well even if ran on an entire XML or HTML document. It uses multinomial naive Bayes learner. The details of the library are defined in this paper.
The pretrained model supports 97 languages. Languages supported.
It’s possible and easy to train it on your dataset.
pycld2
pycld2 (pypi) is a python-binding for Compact Language Detect 2, Google’s algorithm originally used in Chrome browser. CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. The design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text, lists of proper names, part numbers, etc. It’s highly optimized for space and speed.
Gcld3
gcld3 (pypi) is a python binding to CLD3, a Google’s neural network model for language identification. It’s an evolution of and replaced the CLD2 algorithm in the Chrome browser. It supports 107 languages.
Fasttext
Fasttext (pypi) is a library for efficient learning of word representations and sentence classification by Facebook. It’s developed for production use cases so runtime and memory constraints are important concerns for this library. The library doesn’t come with a language classification out of the box, but the library page offers two models for this task
- lid.176.bin, which is faster and slightly more accurate, but has a file size of 126MB.
- lid.176.ftz, which is the compressed version of the model, with a file size of 917kB.
Languages supported. It’s possible to train the model on your own data.
Reported metrics
I report aggregated accuracy for the whole dataset, but also accuracy, precision, recall and f1 score per language, for each of the models.
Given that no classifier supports all languages on the first dataset, the aggregated accuracy metric is calculated only on the subset of the original dataset containing supported languages. That affects only the first dataset given that the second and third dataset contain only languages supported by all classifiers.
Datasets
Tatoeba-sentences-2021-06-05
This dataset is a dump of Tatoeba sentences made on 5th of July 2021.
Quick specs:
- Consists of 9,640,185 sentences in 399 languages.
- Sentences are very short. They consist of 35 characters on average (counted with python’s len function).
- Top 4 languages are English, Russian, Italian, Turkish. You can see more details about different languages in the dataset here (including breakdown per language).
Stats per language | Download script.
Tatoeba-sentences-2021-06-05-common-48
Given that comparing libraries’ accuracies can be difficult on previous dataset, I’ve created a subset of the dataset above, limited to 48 common languages. That allows direct comparison between libraries.
Stats per language | Download script.
Open-subtitles-v2018-100k-per-lang
This dataset is a subset of OpenSubtitles v2018 dataset, limited to top 100k conversations per language, only languages supported by all libraries.
The dataset is used to make sure that fasttext library is not having an unfair advantage, given that it’s trained on some snapshot of tatoeba dataset.
Keep in mind that this dataset is quite messy. It has many short sequences (<20 characters) and sequences made of very few or no words.
Stats per language | Download script.
Benchmark results
On Tatoeba-sentences-2021-06-05
These are results on the biggest dataset, containing 300+ languages. Be aware that aggregated accuracy is calculated only on the supported subset, which means that direct comparison between two classifiers might mean comparing apples and oranges. There are per-language metrics linked that might be useful for deeper analysis.
Library | Supported languages | # sentences supported | Aggregated accuracy | Per language metrics |
---|---|---|---|---|
fasttext | 176 | 9,640,185 (87.64%) | 98.27% | See metrics |
fasttext-compressed | 176 | 9,640,185 (87.64%) | 96.81% | See metrics |
gcld3 | 107 | 9,640,185 (85.70%) | 87.11% | See metrics |
langdetect | 55 | 9,640,185 (77.40%) | 92.45% | See metrics |
langid | 97 | 9,640,185 (86.08%) | 89.00% | See metrics |
pycld2 | 83 | 9,640,185 (78.52%) | 86.95% | See metrics |
On Tatoeba-sentences-2021-06-05-common-48
Library | Supported languages | # sentences supported | Aggregated accuracy | Per language metrics |
---|---|---|---|---|
fasttext | 176 | 7,461,627 (100.00%) | 98.94% | See metrics |
fasttext-compressed | 176 | 7,461,627 (100.00%) | 97.90% | See metrics |
gcld3 | 107 | 7,461,627 (100.00%) | 86.98% | See metrics |
langdetect | 55 | 7,461,627 (100.00%) | 92.47% | See metrics |
langid | 97 | 7,461,627 (100.00%) | 90.15% | See metrics |
pycld2 | 83 | 7,461,627 (100.00%) | 87.12% | See metrics |
On open-subtitles-v2018-100k-per-lang
Library | Supported languages | # sentences supported | Aggregated accuracy | Per language metrics |
---|---|---|---|---|
fasttext | 176 | 4,236,418 (100.00%) | 80.16% | See metrics |
fasttext-compressed | 176 | 4,236,418 (100.00%) | 75.21% | See metrics |
gcld3 | 107 | 4,236,418 (100.00%) | 73.08% | See metrics |
langdetect | 55 | 4,236,418 (100.00%) | 79.48% | See metrics |
langid | 97 | 4,236,418 (100.00%) | 74.19% | See metrics |
pycld2 | 83 | 4,236,418 (100.00%) | 68.41% | See metrics |
Runtime speed / memory consumption
Running an ML model in production often brings more concerns than just the accuracy. Library speed and memory consumption often bring the library from the “it could work” category to a total no-goer for production deployments. Therefore, I present the latency/throughput measurements and the memory consumption.
I’ve measured it on a MacBook Pro (M1 chip), representing the dev environment, and an EC2 machine (c5.xlarge), representing the production environment. You can read more on the exact measurement methodology in the performance methodology section.
Results for tatoeba-sentences-2021-06-05 dataset.
Library | Mean Latency (ms/sentence) | Latency stddev | Throughput (sentence/s) | Memory usage | |
langdetect | MacBook Pro | 4.2669 | 7.4051 | 234 | 69MB |
c5.xlarge | 4.1710 | 4.5386 | 239 | ||
langid | MacBook Pro | 0.7882 | 0.4780 | 1269 | 36MB |
c5.xlarge | 1.1150 | 0.5163 | 897 | ||
pycld2 | MacBook Pro | 0.0038 | 0.0042 | 258366 | 0.24MB |
c5.xlarge | 0.0048 | 0.0046 | 208037 | ||
gcld3 | MacBook Pro | 0.0572 | 0.0254 | 17494 | 1.52MB |
c5.xlarge | 0.0747 | 0.0357 | 13372 | ||
fasttext (lid.176.bin) |
MacBook Pro | 0.0089 | 0.0043 | 112223 | 136MB |
c5.xlarge | 0.0095 | 0.0058 | 105253 | ||
fasttext-compressed (lid.176.ftz) |
MacBook Pro | 0.0107 | 0.0064 | 93406 | 3.53MB |
c5.xlarge | 0.0131 | 0.0096 | 76042 |
Conclusion
The benchmark shows that different libraries offer different tradeoffs in terms of language coverage, accuracy, speed, and memory footprint.
Surprisingly, langdetect, the oldest model in the benchmark, performs surprisingly well in terms of accuracy, but is so slow (~1100x slower than pycld2) that it serves better as an accuracy benchmark than a tool used in production.
Pycld2 is the fastest one with a tiny memory footprint, but also performs at the lower end of accuracy.
Fasttext sits in the middle. It’s very fast (~120k sentences/s on MBP M1), it covers the biggest number of languages, and has the highest accuracy on all three datasets. It also provides two different models so you can trade off smaller memory footprint for a tiny accuracy hit.
Overall, fasttext seems to be a great default choice for the language identification task whenever you don’t have time or a labeled dataset to benchmark it on.
Reproducing the benchmark
If you want to run the benchmark on your hardware or confirm that it’s performing similar to numbers seen here, go to GH repo. It contains all the instructions. If you find it difficult to run, please open an issue in the GH repo.
Appendix A: On performance measurement methodology
Measuring the latency and throughput can mean different things in different contexts. It’s often not specified enough to be useful. The closest to the truth is the benchmarking code that recorded numbers, but I’ll explain the intent behind the code:
- Latency is measured as the duration of a library call and storing the result in the temporary variable. It does not include iterating over the array of texts. Why? Different collections (python lists, DataFrames, numpy arrays) have very different access times to specific elements.
- Throughput is not measured, but calculated as 1/(mean latency).
Results on MacBook Pro are captured while no other user apps were running and the laptop was connected to power. I’ve used python 3.8.9.
Results on EC2 machine are captured on c5.xlarge instance with Amazon Linux installed. I’ve used python 3.8.9.
Memory footprint is measured as the difference of the RSS memory after and before the library was loaded and one inference request finished.