Jekyll2023-06-18T07:02:02+00:00https://modelpredict.com/feed.xmlmodel.predictNuance is difficult to sell on the Internet. Including hybrid work2023-06-17T00:00:00+00:002023-06-17T00:00:00+00:00https://modelpredict.com/nuance-is-difficult-to-sell-and-hybrid-work<p>I’ve had that inkling for a while that it’s difficult to explain the nuance, but also make it appealing on the internet at the same time. It may not be difficult to explain it, but it seems really difficult to sell it.</p>
<p>In the endless stream of content, yours has to hook people to click (now!) and read it. The best way is to take a really strong position, and unfortunately, it has to fit in a sentence or two.</p>
<p>I’ve recently read DHH’s <a href="https://world.hey.com/dhh/hybrid-combines-the-worst-of-office-and-remote-work-d3174e50">Hybrid combines the worst of office and remote work</a>, sent it immediately to my good friend and felt “aha, this is really true, it’s combining the worst of both worlds. The way we work is shit!”</p>
<p>But then I stopped for few minutes and figured there’s more than was explained. There’s more than one side of the argument.</p>
<p>The situation is more complex, but it’d get far fewer clicks if the author tried to explain it in the short description that fits the tweet/LinkedIn post, and the main point would dissolve.</p>
<p>There would be no punchy post, but a discussion of pros and cons of the hybrid work. Far less sexy.</p>
<h2 id="on-hybrid-work">On Hybrid work</h2>
<p><em>Before I go into this, let’s just say that I admire DHH. He’s a very prolific engineer, gave an enormous value to the world by building Rails, and he’s a good racing driver. I learnt a lot from him.</em></p>
<p>He took only one side of the argument. It’s true that hybrid work combines the worst of office and remote work, but it’s not the whole package.</p>
<p>It combines the best parts of it, too.</p>
<p>You can live further from the office.
You choose when you come to the office, and when you don’t.
You meet people face to face and get a chance to hash something out quickly.
You can randomly bump into people and arrange a beer.
And you stay at home most of the time, carving out focus time the way you want.
You can walk a dog or go for a run whenever you want.</p>
<p>There are two sides of the coin, but it’s more difficult to get a punchy post when both are presented.</p>
<p>Another punchy argument that doesn’t hold true is that it’s bad managers that are pushing the agenda of returning back to the office, and just temporarily masquerading it as a hybrid work. I’m sure it’s happening in some companies, but I don’t see that happening around me.</p>
<p>When the pandemic hit, Intercom started working from home. Some people (including me) moved, often to cities where Intercom doesn’t have an office.</p>
<p>These who stayed are now back to the office. But guess what - most of them are not in the office 5 days in a week. Some come 2, some come 5, some don’t come at all. They use the flexibility that the company gives them to arrange their days as they see fit.</p>
<p>It’s not some bad managers that forced them to go back to the office. They go to the office by their choice, when they want. And yes, when they go there, they have to call me on Google Hangouts, but they have to call the other teammate in Dublin too, because they decided to save some commute time and stay at home that day.</p>
<p>Pandemic changed the baseline expectations around flexibility people have, and yes it has some cost - we have to use VC more.</p>
<h2 id="how-to-recognise-the-missing-nuance">How to recognise the missing nuance?</h2>
<p>The question that comes to my head after thinking about that post is how do I recognise similar thing next time.</p>
<p>For me, it comes to two things</p>
<ul>
<li>Strong feeling of “aha, something is really good/bad”.</li>
<li>Trying to create the opposite argument, and see how true/false it seems to me.</li>
</ul>
<p>Hope this helps you next time to discover the nuance.</p>
<p>Learning how to do it made my world infinitely more wonderful (thanks to my team).</p>I’ve had that inkling for a while that it’s difficult to explain the nuance, but also make it appealing on the internet at the same time. It may not be difficult to explain it, but it seems really difficult to sell it.How to make startup scripts for Jupyter kernels reliable?2022-10-31T00:00:00+00:002022-10-31T00:00:00+00:00https://modelpredict.com/how-to-write-reliable-scripts-ipython<p>Running some code whenever your Jupyter notebook starts is handy and easy.</p>
<p>You put a some code in <code>$(ipython locate)/profile_default/startup/00-mycode.py</code> and improve your workflow instantly. You can avoid writing <code>import numpy as np</code> in every notebook, check that you’re not running a notebook with production credentials etc.</p>
<p>There’s a big catch, though — <strong>they fail silently. If there’s an error in running the code from startup scripts, the notebook will run as nothing has happened.</strong></p>
<p>Having them fail silently is not a big deal if you’re using them to import numpy — you’ll figure it out first time you reference the variable. But if you’re using them to set up some security guardrails and they fail, you’re in trouble.</p>
<h2 id="how-to-make-startup-scripts-fail-hard">How to make startup scripts fail hard?</h2>
<p>Unfortunately, there is no way to just flip some config flag and make them fail, but there is a workaround — <strong>you can write your startup script as a <a href="https://ipython.readthedocs.io/en/stable/config/extensions/index.html">ipython extension</a>.</strong> I’ll explain how to do it, and then explain the anatomy of the solution.</p>
<p>I’ve create <a href="https://github.com/modelpredict/reliable-ipython-startup">a tiny repository</a> with everything needed. There are three steps</p>
<ol>
<li>Clone the <a href="https://github.com/modelpredict/reliable-ipython-startup">repo</a>.</li>
<li>Copy your startup code to <em>profile_changes/startup_extensions/</em><a href="https://github.com/modelpredict/reliable-ipython-startup/blob/main/profile_changes/startup_extensions/extension_example.py">extension_example.py</a>. The <em>load_ipython_extension</em> function is not necessary unless you need a handle to <em>ipython</em>.</li>
<li>Copy both files to the default ipython profile with following command</li>
</ol>
<pre><code class="language-bash">cp -r profile_changes/* $(ipython locate)/profile_default
</code></pre>
<p>These steps will make sure that your code is loaded every time a kernel starts, and that it fails if your script fails.</p>
<aside>
<p>💡 If you change the script in the future, don’t forget to run the copy command again.</p>
</aside>
<aside>
<p>🤖 Extensions and your python code won’t share the variable namespace. <strong>Pre-importing packages like numpy in extensions won’t work by default.</strong></p>
<p>If you need that, read the post all the way to the end.</p>
</aside>
<h2 id="how-does-it-work">How does it work?</h2>
<p>The solution consists of two components.</p>
<p><a href="https://github.com/modelpredict/reliable-ipython-startup/blob/396dbba15cbac2d614a2789855cb1459a7f2623e/profile_changes/startup_extensions/extension_example.py"><strong>1. Very simple ipython extension</strong></a></p>
<p>You’ve seen the content of that one while you were creating the script. Instead of startup scripts, we use a slim ipython extension because we can configure ipython to fail if extension loading fails.</p>
<p>The code is located in <code>.ipython/profile_default/startup_extensions/extension_example.py</code>, which we later on instruct ipython to load on startup.</p>
<p><strong><a href="https://www.notion.so/How-to-make-startup-scripts-for-Jupyter-kernels-reliable-f26e3b4b7b6a4a208b3caf9bda3a623a">2. IPython config</a></strong></p>
<p>We change the config (e.g. <em>~/.ipython/profile_default/ipython_config.py</em>) to load your extension. The change looks something like:</p>
<pre><code class="language-python">import sys
import os
sys.path.append(os.path.join(os.path.dirname(__file__), "startup_extensions"))
c.InteractiveShellApp.reraise_ipython_extension_failures = True
c.InteractiveShellApp.extensions.append(
'extension_example'
)
</code></pre>
<p>It does 3 things</p>
<ul>
<li>Makes ipython fails if loading any extensions fail. Without that, extensions are no better than your startup scripts (line 5)</li>
<li>Makes sure that ipython can find your extensions (lines 1-3)</li>
<li>Loads your <strong>extension_example.py</strong> (line 6)<strong>.</strong></li>
</ul>
<h2 id="name-np-is-not-defined-or-how-to-make-sure-imported-packages-are-visible-in-notebooks">“name ‘np’ is not defined” or How to make sure imported packages are visible in notebooks?</h2>
<p>Variables created/imported in the extension won’t be visible in notebooks by default. That’s because extensions are loaded as python packages, and don’t share the namespace with your notebooks.</p>
<p>Your extensions can make these variables visible, but you have to do it explicitly by pushing it to <em>ipython.user_ns</em>. Here’s an example how:</p>
<pre><code class="language-python">import numpy as np
def load_ipython_extension(ipython):
# This will make np visible in user namespace (notebook)
ipython.user_ns['np'] = np
</code></pre>Running some code whenever your Jupyter notebook starts is handy and easy.Comparison of language identification models2021-08-10T00:00:00+00:002021-08-10T00:00:00+00:00https://modelpredict.com/language-identification-survey<p>Detecting the text language (often called language identification) is a common task when building machine learning systems. It takes text as an input and predicts the language given text is written in.</p>
<p>It’s a very useful functionality to have (e.g. storing texts in per language ElasticSearch index), but picking the library to use isn’t trivial at all.</p>
<p>This survey will give you <strong>up-to-date info on text detection libraries usable in production</strong>. Accuracy, language coverage, speed and memory consumption. Everything you need as an ML engineer to pick a library quickly. Oh, and a bonus. It’s reproducible and extensible so you can run it on your dataset.</p>
<ul>
<li><a href="#why-another-survey">Why another survey?</a></li>
<li><a href="#benchmarked-libraries">Benchmarked libraries</a></li>
<li><a href="#reported-metrics">Reported metrics</a></li>
<li><a href="#datasets">Datasets</a></li>
<li><a href="#benchmark-results">Benchmark results</a></li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#reproducing-the-benchmark">Reproducing the benchmark</a></li>
<li><a href="#performance-measurement">On performance measurement methodology</a></li>
</ul>
<div class="convertkit-form">
<script async="" data-uid="af03941f9e" src="https://prodigious-builder-7392.ck.page/af03941f9e/index.js"></script>
</div>
<h2 id="why-another-survey">Why another survey?</h2>
<p>Choosing a tool for detecting text language can be quite tricky. There are <a href="https://arxiv.org/abs/1804.08186">good surveys out there</a>, but choosing the tool is difficult even after reading them. Reasons I’ve found it difficult:</p>
<ul>
<li>Different models support different sets of languages. Comparing numbers becomes very difficult.</li>
<li>Running some model in production often means juggling between accuracy and runtime speed/memory constraints, which is rarely discussed in academic papers.</li>
<li>There is no standardized dataset for comparison, and papers often compare with one or two competitors—it’s difficult to compare numbers.</li>
<li>Models’ performances differ by language, but papers often publish just aggregated accuracy.</li>
<li>Many methods never get a developer-friendly python package easy to install and use.</li>
<li>Surveys go too broad.</li>
<li>Very few papers actually provide the code to verify results.</li>
</ul>
<p>I’ll try to fill that gap and make it easier to choose the language identification python package for your use case. More specifically, this survey:</p>
<ul>
<li>Benchmarks only packages that are practical to install and use.</li>
<li>Runs benchmarks on several different datasets, <a href="https://github.com/modelpredict/language-identification-survey/tree/main/datasets">all downloadable</a>.</li>
<li>Brings correctness metrics in aggregate, but also per language.</li>
<li>Brings wall time, latency + throughput, and memory consumption.</li>
<li>Does it all in a way so you can extend and reproduce it on your hardware (<a href="https://github.com/modelpredict/language-identification-survey">see GH repo</a>).</li>
</ul>
<h2 id="benchmarked-libraries">Benchmarked libraries</h2>
<h3 id="langdetect">Langdetect</h3>
<p><a href="https://pypi.org/project/langdetect/">Langdetect</a> (pypi) is a python-port of Nakatani Shuyo’s <a href="https://github.com/shuyo/language-detection">language-detection</a> library. By this point, it definitely represents an “old technique” for identifying the language, created way before (2010) the rise of neural networks in NLP. Despite the age, when published, it claimed to reach 99%+ accuracy on 49 supported languages. There is no paper published about it, but there is a <a href="https://www.slideshare.net/shuyo/language-detection-library-for-java">presentation about it</a>.</p>
<p><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langdetect/classification_performance.md#supported-languages-55">Languages supported.</a></p>
<h3 id="langid">Langid</h3>
<p><a href="https://pypi.org/project/langid/">Langid</a> (pypi) is a popular library coming from academia, but providing a really easy-to-use python package. It’s designed to be fast (see results), having minimal dependencies and being markup-agnostic, meaning it should perform reasonably well even if ran on an entire XML or HTML document. It uses multinomial naive Bayes learner. The details of the library are defined in <a href="https://aclanthology.org/I11-1062/">this paper</a>.</p>
<p>The pretrained model supports 97 languages. <a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langid/classification_performance.md#supported-languages-97">Languages supported.</a></p>
<p>It’s possible and easy to train it on your dataset.</p>
<h3 id="pycld2">pycld2</h3>
<p><a href="https://pypi.org/project/pycld2/">pycld2 (pypi)</a> is a python-binding for Compact Language Detect 2, Google’s algorithm originally used in Chrome browser. CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. The design target is web pages of at least 200 characters (about two sentences); CLD2 is not designed to do well on very short text, lists of proper names, part numbers, etc. It’s highly optimized for space and speed.</p>
<p><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/pycld2/classification_performance.md">Languages supported</a></p>
<h3 id="gcld3">Gcld3</h3>
<p><a href="https://pypi.org/project/gcld3/">gcld3 (pypi)</a> is a python binding to <a href="https://github.com/google/cld3">CLD3</a>, a Google’s neural network model for language identification. It’s an evolution of and replaced the CLD2 algorithm in the Chrome browser. It supports 107 languages.</p>
<p><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/gcld3/classification_performance.md#supported-languages-107">Languages supported</a></p>
<h3 id="fasttext">Fasttext</h3>
<p><a href="https://pypi.org/project/fasttext/">Fasttext (pypi)</a> is a library for efficient learning of word representations and sentence classification by Facebook. It’s developed for production use cases so runtime and memory constraints are important concerns for this library. The library doesn’t come with a language classification out of the box, but the library page offers two models for this task</p>
<ul>
<li><a href="https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin">lid.176.bin</a>, which is faster and slightly more accurate, but has a file size of 126MB.</li>
<li><a href="https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz">lid.176.ftz</a>, which is the compressed version of the model, with a file size of 917kB.</li>
</ul>
<p><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/fasttext/classification_performance.md#supported-languages-176">Languages supported</a>. It’s possible to train the model on your own data.</p>
<h2 id="reported-metrics">Reported metrics</h2>
<p>I report aggregated accuracy for the whole dataset, but also accuracy, precision, recall and f1 score per language, for each of the models.</p>
<p>Given that no classifier supports all languages on the first dataset, the aggregated accuracy metric is calculated only on the subset of the original dataset containing supported languages. That affects only the first dataset given that the second and third dataset contain only languages supported by all classifiers.</p>
<h2 id="datasets">Datasets</h2>
<h3 id="tatoeba-sentences-2021-06-05">Tatoeba-sentences-2021-06-05</h3>
<p>This dataset is a dump of <a href="https://tatoeba.org/en/">Tatoeba sentences</a> made on 5th of July 2021.</p>
<p>Quick specs:</p>
<ul>
<li>Consists of 9,640,185 sentences in 399 languages.</li>
<li>Sentences are very short. They consist of 35 characters on average (counted with python’s len function).</li>
<li>Top 4 languages are English, Russian, Italian, Turkish. You can see more details about different languages in the dataset <a href="https://github.com/modelpredict/language-identification-survey/blob/main/datasets/tatoeba-sentences-2021-06-05/stats.md">here</a> (including breakdown per language).</li>
</ul>
<p><a href="https://github.com/modelpredict/language-identification-survey/blob/main/datasets/tatoeba-sentences-2021-06-05/stats.md">Stats per language</a> | <a href="https://github.com/modelpredict/language-identification-survey/blob/main/datasets/tatoeba-sentences-2021-06-05/download">Download script</a>.</p>
<h3 id="tatoeba-sentences-2021-06-05-common-48">Tatoeba-sentences-2021-06-05-common-48</h3>
<p>Given that comparing libraries’ accuracies can be difficult on previous dataset, I’ve created a subset of the dataset above, limited to 48 common languages. That allows direct comparison between libraries.</p>
<p><a href="https://github.com/modelpredict/language-identification-survey/blob/main/datasets/tatoeba-sentences-2021-06-05-common-48/stats.md">Stats per language</a> | <a href="https://github.com/modelpredict/language-identification-survey/blob/main/datasets/tatoeba-sentences-2021-06-05-common-48/download">Download script</a>.</p>
<h3 id="open-subtitles-v2018-100k-per-lang">Open-subtitles-v2018-100k-per-lang</h3>
<p>This dataset is a subset of <a href="https://opus.nlpl.eu/OpenSubtitles-v2018.php">OpenSubtitles v2018</a> dataset, limited to top 100k conversations per language, only languages supported by all libraries.</p>
<p>The dataset is used to make sure that fasttext library is not having an unfair advantage, given that it’s trained on some snapshot of tatoeba dataset.</p>
<p>Keep in mind that this dataset is quite messy. It has many short sequences (<20 characters) and sequences made of very few or no words.</p>
<p><a href="https://github.com/modelpredict/language-identification-survey/blob/main/datasets/open-subtitles-v2018-100k-per-lang/stats.md">Stats per language</a> | <a href="https://github.com/modelpredict/language-identification-survey/blob/main/datasets/open-subtitles-v2018-100k-per-lang/download">Download script</a>.</p>
<h2 id="benchmark-results">Benchmark results</h2>
<h3 id="on-tatoeba-sentences-2021-06-05">On Tatoeba-sentences-2021-06-05</h3>
<p>These are results on the biggest dataset, containing 300+ languages. Be aware that aggregated accuracy is calculated only on the supported subset, which means that direct comparison between two classifiers might mean comparing apples and oranges. There are per-language metrics linked that might be useful for deeper analysis.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Library</th>
<th style="text-align: left">Supported languages</th>
<th style="text-align: left"># sentences supported</th>
<th style="text-align: left">Aggregated accuracy</th>
<th style="text-align: left">Per language metrics</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">fasttext</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/fasttext/classification_performance.md#supported-languages">176</a></td>
<td style="text-align: left">9,640,185 (87.64%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/fasttext/classification_performance.md">98.27%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/fasttext/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">fasttext-compressed</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/fasttext-compressed/classification_performance.md#supported-languages">176</a></td>
<td style="text-align: left">9,640,185 (87.64%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/fasttext-compressed/classification_performance.md">96.81%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/fasttext-compressed/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">gcld3</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/gcld3/classification_performance.md#supported-languages">107</a></td>
<td style="text-align: left">9,640,185 (85.70%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/gcld3/classification_performance.md">87.11%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/gcld3/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">langdetect</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/langdetect/classification_performance.md#supported-languages">55</a></td>
<td style="text-align: left">9,640,185 (77.40%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/langdetect/classification_performance.md">92.45%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/langdetect/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">langid</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/langid/classification_performance.md#supported-languages">97</a></td>
<td style="text-align: left">9,640,185 (86.08%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/langid/classification_performance.md">89.00%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/langid/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">pycld2</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/pycld2/classification_performance.md#supported-languages">83</a></td>
<td style="text-align: left">9,640,185 (78.52%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/pycld2/classification_performance.md">86.95%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05/pycld2/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
</tbody>
</table>
<h4 id="on-tatoeba-sentences-2021-06-05-common-48">On Tatoeba-sentences-2021-06-05-common-48</h4>
<table>
<thead>
<tr>
<th style="text-align: left">Library</th>
<th style="text-align: left">Supported languages</th>
<th style="text-align: left"># sentences supported</th>
<th style="text-align: left">Aggregated accuracy</th>
<th style="text-align: left">Per language metrics</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">fasttext</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/fasttext/classification_performance.md#supported-languages">176</a></td>
<td style="text-align: left">7,461,627 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/fasttext/classification_performance.md">98.94%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/fasttext/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">fasttext-compressed</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/fasttext-compressed/classification_performance.md#supported-languages">176</a></td>
<td style="text-align: left">7,461,627 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/fasttext-compressed/classification_performance.md">97.90%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/fasttext-compressed/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">gcld3</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/gcld3/classification_performance.md#supported-languages">107</a></td>
<td style="text-align: left">7,461,627 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/gcld3/classification_performance.md">86.98%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/gcld3/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">langdetect</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langdetect/classification_performance.md#supported-languages">55</a></td>
<td style="text-align: left">7,461,627 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langdetect/classification_performance.md">92.47%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langdetect/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">langid</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langid/classification_performance.md#supported-languages">97</a></td>
<td style="text-align: left">7,461,627 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langid/classification_performance.md">90.15%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/langid/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">pycld2</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/pycld2/classification_performance.md#supported-languages">83</a></td>
<td style="text-align: left">7,461,627 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/pycld2/classification_performance.md">87.12%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/tatoeba-sentences-2021-06-05-common-48/pycld2/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
</tbody>
</table>
<h4 id="on-open-subtitles-v2018-100k-per-lang">On open-subtitles-v2018-100k-per-lang</h4>
<table>
<thead>
<tr>
<th style="text-align: left">Library</th>
<th style="text-align: left">Supported languages</th>
<th style="text-align: left"># sentences supported</th>
<th style="text-align: left">Aggregated accuracy</th>
<th style="text-align: left">Per language metrics</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">fasttext</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/fasttext/classification_performance.md#supported-languages">176</a></td>
<td style="text-align: left">4,236,418 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/fasttext/classification_performance.md">80.16%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/fasttext/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">fasttext-compressed</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/fasttext-compressed/classification_performance.md#supported-languages">176</a></td>
<td style="text-align: left">4,236,418 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/fasttext-compressed/classification_performance.md">75.21%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/fasttext-compressed/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">gcld3</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/gcld3/classification_performance.md#supported-languages">107</a></td>
<td style="text-align: left">4,236,418 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/gcld3/classification_performance.md">73.08%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/gcld3/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">langdetect</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/langdetect/classification_performance.md#supported-languages">55</a></td>
<td style="text-align: left">4,236,418 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/langdetect/classification_performance.md">79.48%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/langdetect/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">langid</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/langid/classification_performance.md#supported-languages">97</a></td>
<td style="text-align: left">4,236,418 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/langid/classification_performance.md">74.19%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/langid/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
<tr>
<td style="text-align: left">pycld2</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/pycld2/classification_performance.md#supported-languages">83</a></td>
<td style="text-align: left">4,236,418 (100.00%)</td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/pycld2/classification_performance.md">68.41%</a></td>
<td style="text-align: left"><a href="https://github.com/modelpredict/language-identification-survey/blob/main/results/open-subtitles-v2018-100k-per-lang/pycld2/classification_performance.md#metrics-per-language">See metrics</a></td>
</tr>
</tbody>
</table>
<h3 id="runtime-speed--memory-consumption">Runtime speed / memory consumption</h3>
<p>Running an ML model in production often brings more concerns than just the accuracy. Library speed and memory consumption often bring the library from the “it could work” category to a total no-goer for production deployments. Therefore, I present the latency/throughput measurements and the memory consumption.</p>
<p>I’ve measured it on a MacBook Pro (M1 chip), representing the dev environment, and an EC2 machine (c5.xlarge), representing the production environment. You can read more on the exact <a href="#performance-measurement">measurement methodology</a> in the performance methodology section.</p>
<p>Results for tatoeba-sentences-2021-06-05 dataset.</p>
<table>
<tr>
<td>Library</td>
<td></td>
<td>Mean Latency (ms/sentence)</td>
<td>Latency stddev</td>
<td>Throughput (sentence/s)</td>
<td>Memory usage</td>
</tr>
<tr>
<td rowspan="2">langdetect</td>
<td>MacBook Pro</td>
<td>4.2669</td>
<td>7.4051</td>
<td>234</td>
<td rowspan="2">69MB</td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>4.1710</td>
<td>4.5386</td>
<td>239</td>
</tr>
<tr>
<td rowspan="2">langid</td>
<td>MacBook Pro</td>
<td>0.7882</td>
<td>0.4780</td>
<td>1269</td>
<td rowspan="2">36MB</td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>1.1150</td>
<td>0.5163</td>
<td>897</td>
</tr>
<tr>
<td rowspan="2">pycld2</td>
<td>MacBook Pro</td>
<td><b>0.0038</b></td>
<td><b>0.0042</b></td>
<td><b>258366</b></td>
<td rowspan="2"><b>0.24MB</b></td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>0.0048</td>
<td>0.0046</td>
<td>208037</td>
</tr>
<tr>
<td rowspan="2">gcld3</td>
<td>MacBook Pro</td>
<td>0.0572</td>
<td>0.0254</td>
<td>17494</td>
<td rowspan="2">1.52MB</td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>0.0747</td>
<td>0.0357</td>
<td>13372</td>
</tr>
<tr>
<td rowspan="2">
<p>fasttext (<a href="https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin">lid.176.bin</a>)</p>
</td>
<td>MacBook Pro</td>
<td>0.0089</td>
<td>0.0043</td>
<td>112223</td>
<td rowspan="2">136MB</td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>0.0095</td>
<td>0.0058</td>
<td>105253</td>
</tr>
<tr>
<td rowspan="2">
<p>fasttext-compressed (<a href="https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.ftz">lid.176.ftz</a>)</p>
</td>
<td>MacBook Pro</td>
<td>0.0107</td>
<td>0.0064</td>
<td>93406</td>
<td rowspan="2">3.53MB</td>
</tr>
<tr>
<td>c5.xlarge</td>
<td>0.0131</td>
<td>0.0096</td>
<td>76042</td>
</tr>
</table>
<h2 id="conclusion">Conclusion</h2>
<p>The benchmark shows that different libraries offer different tradeoffs in terms of language coverage, accuracy, speed, and memory footprint.</p>
<p>Surprisingly, langdetect, the oldest model in the benchmark, performs surprisingly well in terms of accuracy, but is so slow (~1100x slower than pycld2) that it serves better as an accuracy benchmark than a tool used in production.</p>
<p>Pycld2 is the fastest one with a tiny memory footprint, but also performs at the lower end of accuracy.</p>
<p>Fasttext sits in the middle. It’s very fast (~120k sentences/s on MBP M1), it covers the biggest number of languages, and has the highest accuracy on all three datasets. It also provides two different models so you can trade off smaller memory footprint for a tiny accuracy hit.</p>
<p>Overall, <strong>fasttext seems to be a great default choice for the language identification task</strong> whenever you don’t have time or a labeled dataset to benchmark it on.</p>
<div class="convertkit-form">
<script async="" data-uid="af03941f9e" src="https://prodigious-builder-7392.ck.page/af03941f9e/index.js"></script>
</div>
<h2 id="reproducing-the-benchmark">Reproducing the benchmark</h2>
<p>If you want to run the benchmark on your hardware or confirm that it’s performing similar to numbers seen here, go to <a href="https://github.com/modelpredict/language-identification-survey">GH repo</a>. It contains all the instructions. If you find it difficult to run, please open an issue in the GH repo.</p>
<h2 id="performance-measurement">Appendix A: On performance measurement methodology</h2>
<p>Measuring the latency and throughput can mean different things in different contexts. It’s often not specified enough to be useful. The closest to the truth is the <a href="https://github.com/modelpredict/language-identification-survey/blob/9044f6883fff8be9e47a23de43d5e02675f0e7a1/models/gcld3.py#L16-L18">benchmarking code</a> that recorded numbers, but I’ll explain the intent behind the code:</p>
<ul>
<li>Latency is measured as the duration of a library call and storing the result in the temporary variable. It does not include iterating over the array of texts. Why? Different collections (python lists, DataFrames, numpy arrays) have very different access times to specific elements.</li>
<li>Throughput is not measured, but calculated as 1/(mean latency).</li>
</ul>
<p>Results on MacBook Pro are captured while no other user apps were running and the laptop was connected to power. I’ve used python 3.8.9.</p>
<p>Results on EC2 machine are captured on c5.xlarge instance with Amazon Linux installed. I’ve used python 3.8.9.</p>
<p>Memory footprint is measured as the difference of the RSS memory after and before the library was loaded and one inference request finished.</p>Up-to-date info on language identification libraries usable in production. Accuracy, language coverage, speed and memory consumption. Everything you need as an ML engineer to pick a library quickly.Github Actions: using python version from .python-version file (pyenv)2021-04-08T00:00:00+00:002021-04-08T00:00:00+00:00https://modelpredict.com/pyenv-version-in-github-action<p>I was recently creating a CI pipeline for a toy ML project. It was to make sure that our training accuracies stay within some thresholds as we change the code. We were hosting the project on GitHub so GitHub Actions seemed like a great fit to run our CI.</p>
<p>I’d found GitHub’s <a href="https://github.com/actions/setup-python">actions/setup-python@v2</a> to set up the specific python version, but it wasn’t clear how to set a version used <a href="https://github.com/pyenv/pyenv">pyenv</a> (a great tool for managing multiple python versions btw), one written in the <code>.python-version</code> file.</p>
<p>Turns out combining <a href="https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions">action contexts</a> and <a href="https://docs.github.com/en/actions/reference/environment-variables">environment variables</a> does the trick.</p>
<ol>
<li>One step reads the <code>.python-version</code> file and writes that to environment variable</li>
<li><a href="https://github.com/actions/setup-python">actions/setup-python@v2</a> installs the version from the environment variable.</li>
</ol>
<p>Here’s the snippet defining the job steps</p>
<pre><code class="language-yaml"> steps:
- uses: actions/checkout@v2
- name: Get python version
run: |
python_version=$(cat .python-version)
echo "python_version=${python_version}" >> $GITHUB_ENV
- name: Set up Python $
uses: actions/setup-python@v2
with:
python-version: $
</code></pre>Github Actions: how to use pyenv's python version - one from .python-version file.Why requirements.txt isn’t enough2020-06-17T00:00:00+00:002020-06-17T00:00:00+00:00https://modelpredict.com/why-is-not-requirements-txt-enough<p>How are you maintaining your requirements.txt file? Are you adding and removing your dependencies manually or you’re just running <code>pip freeze > requirements.txt</code>?</p>
<p>Whichever of these two ways you use, you’re doing it wrong. <code>requirements.txt</code> alone is not enough to build reproducible environments that will run the same wherever you put them. That’s obviously a problem, you want your production environment to be tightly defined.</p>
<h2 id="scenario-1-manually-editing-requirementstxt">Scenario #1: manually editing requirements.txt</h2>
<p>This is how everybody in python land starts. You create a <code>requirements.txt</code> file and start putting dependencies your app needs. After editing the file, you run <code>pip install -r requirements.txt</code> to install all the dependencies into your virtual environment.</p>
<p>But here is the problem. Your requirements.txt contains just the first degree dependencies and their versions. Your dependencies also have dependencies (2nd+ degree), and these versions are not necessarily locked down.</p>
<p>Not having these versions locked down means that running <code>pip install -r requirements.txt</code> on different systems or at different points of time will resolve to different sets of package versions. It opens a space for security issues and your app breaking completely.</p>
<h2 id="scenario-2-pip-freeze--requirementstxt">Scenario #2: pip freeze > requirements.txt</h2>
<p>Once you’re aware of the problem above, the solution is simple, you just run <code>pip freeze > requirements.txt</code>.</p>
<p>That is certainly a solution for 2nd+ degree dependency versions, but brings a new problem – once you want to delete a dependency, how do you know that you’ve deleted all their dependencies?</p>
<p>There is no easy way to know. If you spend a little more time with the problem, you’ll probably figure out that you need two files - one that defines direct dependencies of your app and the second one that locks down all transitive dependencies and their versions (a lockfile). That’s a standard solution in other communities (JavaScript, ruby, Rust), but pip does not bring any conventions nor solutions for this.</p>
<p>This is where my anxiety started kicking in – I can easily create these two files myself, but there are no standard names for them and I have to teach everybody on the team how to use this setup I’ve come up with. Then I’ve found pip-tools.</p>
<h2 id="solution-use-pip-compile-from-pip-tools">Solution: use pip-compile (from pip-tools)</h2>
<p>Pip-tools is a set of two tools – pip-compile and pip-sync. pip-compile solves exact problems I’ve described above. It brings a workflow (read convention) and a tool to maintain both files.</p>
<h3 id="how-to-use-pip-compile">How to use pip-compile?</h3>
<p>Create a <code>requirements.in</code> file and list just the direct dependencies of your app. The same way you’d do with <code>requirements.txt</code> in Scenario #1. Then run <code>pip-compile</code> (or <code>./venv/bin/pip-compile</code> if not installed globally) and it will create <code>requirements.txt</code>, with all the dependencies listed and all the versions locked.</p>
<p><img src="/images/when-to-use-pip-tools/pip-compile.gif" alt="" title="pip-compile evaluates all the dependencies from requirements.in and locks their version into requirements.txt" /></p>
<p>If you’ve used requirements.txt, notice that you can just drop in pip-compile and the rest of your system does not have to change. Whatever was building your app can still use <code>requirements.txt</code>, whoever was just hitting <code>pip install -r requirements.txt</code> can continue doing that. The only thing that has to change is how you add or remove dependencies.</p>
<h2 id="wrapping-up">Wrapping up</h2>
<p>Pip-compile is a simple tool for locking down versions of your dependencies. It’s widely used and it brings a sense of standardization so your team does not have to learn some bespoke setup. Its’ architecture follows the unix philosophy – it solves one specific problem and can be dropped into your project without changing other systems interacting with your app.</p>
<p>If you don’t like the unix philosophy, <a href="https://modelpredict.com/python-dependency-management-tools">Poetry and Pipenv</a> can be used as an all-in-one solution, tackling the versioning problem, too.</p>If you're using only requirements.txt to manage your dependencies, you're in trouble.The minimal conda cheatsheet2020-06-10T00:00:00+00:002020-06-10T00:00:00+00:00https://modelpredict.com/minimal-conda-cheatsheet<p>If you’re not using conda regularly, it’s almost certain you forget how to use it from usage to usage. I’ve compiled a list of commands I use most often.</p>
<p>You can <a href="https://conda-cheatsheet.modelpredict.com/">download the PDF or Google doc version</a> (fits on one page!).</p>
<table class="conda-cheatsheet">
<thead>
<tr>
<td>Environment</td>
<td>Package</td>
</tr>
</thead>
<tbody>
<tr>
<td class="command-category">Activate/switch</td>
<td class="command-category">Install</td>
</tr>
<tr>
<td>
<span class="definition">conda activate [env_name]</span><br />
<span class="example">conda activate pytorch_py36</span>
</td>
<td>
<span class="definition">conda install [package_spec ...]</span><br />
<span class="example">conda install numpy scikit-learn=0.23.1</span>
</td>
</tr>
<tr>
<td class="command-category">Create</td>
<td class="command-category">from specific channel</td>
</tr>
<tr>
<td>
<span class="definition">conda create --name [env_name] [package_spec …]</span><br />
<span class="example">conda create --name math_py36 python=3.6 numpy</span>
</td>
<td>
<span class="definition">conda install -c [channel] [package_spec …]</span><br />
<span class="example">conda install -c conda-forge numpy</span>
</td>
</tr>
<tr>
<td class="command-category">Create from file</td>
<td class="command-category">Update package</td>
</tr>
<tr>
<td rowspan="3">
<span class="definition">conda create --file [path]</span><br />
<span class="example">conda create --file exported_env.txt</span>
</td>
<td>
<span class="definition">conda update [package_spec …]</span><br />
<span class="example">conda update numpy scikit-learn=0.23.1</span>
</td>
</tr>
<tr>
<td class="command-category">Delete package</td>
</tr>
<tr>
<td>
<span class="definition">conda remove [package_name …]</span><br />
<span class="example">conda remove numpy scikit-learn</span>
</td>
</tr>
<tr class="cheatsheet-note">
<td colspan="2">
* package_spec - package_name or package_name=package_version
</td>
</tr>
</tbody>
</table>The smallest conda cheatsheet you'll find around.Overview of python dependency management tools2020-06-01T00:00:00+00:002020-06-01T00:00:00+00:00https://modelpredict.com/python-dependency-management-tools<p>Totally confused by all the tools for managing dependencies? Pip, venv, Docker, conda, virtualenvwrapper, pipenv, … Which one should you use? Why do we even have all these different tools? Can they work together?</p>
<p>No wonder. The world of Python dependency management is a mess, but once you understand the tools and why they exist, it’s going to be easier to choose the one you want and deal with the others in environments where you can’t choose your favorite ones.</p>
<p>I’ll briefly describe each tool , why it’s created and the problems it’s tackling. At the end of the post, you can find a table summarizing all the information and the usual setups people use.</p>
<p>Jump to: <a href="#pip">pip</a> | <a href="#venv">venv</a> | <a href="#pip-tools">pip-tools</a> | <a href="#pyenv">pyenv</a> | <a href="#conda">conda</a> | <a href="#pipenv">pipenv</a> | <a href="#poetry">poetry</a> | <a href="#docker">Docker</a>.
<br />
<a href="#all-solutions-compared">All solutions compared</a> | <a href="#usual-setups">Usual setups</a></p>
<h2 id="pip">pip</h2>
<p><a href="https://pypi.org/project/pip/">Pip</a> (<strong>p</strong>ackage <strong>i</strong>nstaller for <strong>p</strong>ython) is the most basic package installer in the python land. It comes preinstalled with most python installations so it’s likely you never had to install it yourself.</p>
<p>Installing a package is as simple as running <code>pip install torch</code>. That command talks to PyPI (The Python Package Index), downloads the package, and makes it available to the current python installation.</p>
<p>It’s a very primitive tool. It knows nothing about different python versions or Jupyter kernels.</p>
<p><img src="images/python-dependency-management/pip_install_requests.png" alt="" title="Pip just installs the package in whatever _site-packages_ directory active python installation is pointed to. In this case, it's one activated by pyenv - ~/.pyenv/versions/3.6.3/lib/python3.6/site-packages." /></p>
<p>Problems pip solves:</p>
<ul>
<li>Installing python packages</li>
</ul>
<h2 id="venv">venv</h2>
<p><a href="https://docs.python.org/3/library/venv.html">venv</a> is a tool for creating lightweight virtual environments.</p>
<p>The most common use case is creating an environment per app. It makes sure apps don’t share packages between themselves and they don’t share packages with the system’s python installation. Each environment can use any version of the same package and they won’t collide.</p>
<p><img src="images/python-dependency-management/venv_install_requests_marked.png" alt="" title="Activating a virtual environment will make following pip installs belong to that virtual environment. You can also see that python run within virtual environment will look into site-packages within that virtual environment." /></p>
<p>Problems venv solves:</p>
<ul>
<li>Isolating packages between apps</li>
</ul>
<h4 id="how-do-venv-and-pip-interact">How do venv and pip interact?</h4>
<p>They’re both the part of standard python tooling, tackle very different problems, and play together really well. You’re encouraged to use pip for installing packages within virtual environments.</p>
<h2 id="pip-tools">pip-tools</h2>
<p><a href="https://github.com/jazzband/pip-tools">pip-tools</a> is packaging two different tools: <strong>pip-compile</strong> and <strong>pip-sync</strong>.</p>
<p>They’re relatively simple, but tackle a very important problem - keeping your environment reproducible and consistent. People who care about it the most are production engineers.</p>
<p>The most common way to define dependencies is to create a <code>requirements.txt</code> file and list them with specific versions. These dependencies often have other dependencies not listed in <code>requirements.txt</code>.</p>
<p>Unlisted dependencies cause several problems. The most common one is running different package versions in different environments (1). The second most common is running different environments, depending on the time you’ve created them (2). Both eventually lead to the app breaking inconsistently (e.g. uninstalling a package will leave its dependencies in the existing environment, but they won’t exist if I create a new environment and install everything from requirements.txt).</p>
<p>Pip-compile takes <code>setup.py</code>/<code>requirements.in</code> file and compiles a <code>requirements.txt</code> with all dependencies locked to specific versions. Some tools (npm, yarn, bundle) call it a lockfile. It solves the first problem I’ve described.</p>
<p>Pip-sync takes a compiled <code>requirements.txt</code> and ensures that your current environment has exact versions and only packages defined in the <code>requirements.txt</code>. It solves the second problem I’ve described.</p>
<p>Problems pip-tools solve:</p>
<ul>
<li>Environment reproducibility</li>
</ul>
<h4 id="how-does-pip-tools-interact-with-pip-and-other-tools">How does pip-tools interact with pip and other tools?</h4>
<p>pip-tools is a composable tool that solves one specific problem - environment reproducibility. It is designed to be installed inside a virtual environment (created by venv) and wraps pip to install defined packages.</p>
<h2 id="pyenv">pyenv</h2>
<p>Python became a wildly popular language and all major operating systems started building on top of it and bundling it out of the box. That’s why you can just type <code>python</code> in your terminal on a freshly installed Linux or Mac OS without installing it yourself.</p>
<p>But user applications are built in python, too. And they often need a different version of python! Combination of these two created the need to somehow run different versions of python, depending on the application.</p>
<p><a href="https://github.com/pyenv/pyenv">Pyenv</a> was created to solve the problem of installing and switching between different versions of python on the same machine.</p>
<p>It’s a handy tool on developer machines because it keeps the system version of python (needed for the OS to run properly), but can install and switch between different versions for different applications (based on current path, user etc.).</p>
<p>Here’s an example of switching between system version and 3.6.3. Running pyenv local 3.6.3 will remember to activate version 3.6.3 next time you navigate to that directory.</p>
<p><img src="images/python-dependency-management/pyenv_versions.png" alt="" title="Pyenv allows setting the python version for specific directories. That way, you don't have to change it every time you come back to the project." /></p>
<p>Problems pyenv solves:</p>
<ul>
<li>Installing different python versions</li>
<li>Using different python versions in different contexts</li>
</ul>
<h4 id="how-do-pyenv-and-pip-interact">How do pyenv and pip interact?</h4>
<p>Pyenv and pip complement each other. You can consider pyenv being a container/shell for pip. Pip installs packages for the current python version, whatever pyenv sets it to. In fact, <code>pip</code> commands from two environments are different binaries and do not know each other.</p>
<p><img src="images/python-dependency-management/pyenv_which_pip.png" alt="" title="Different python versions resolve pip3 differently." /></p>
<h2 id="conda">Conda</h2>
<p><em>You may know this tool under different names - Anaconda or miniconda.</em></p>
<p>Once the scientific community started using python seriously, requirements for package management tools in python land increased. More specifically, python became too slow for some pure computational workloads so numpy and scipy were born. These libraries are not really written in python. They are written in C and just wrapped as a python library.</p>
<p>Compiling such libraries brings a set of challenges<sup id="fnref:conda-compiling-challenges" role="doc-noteref"><a href="#fn:conda-compiling-challenges" class="footnote" rel="footnote">1</a></sup> since they (more or less) have to be compiled on your machine for maximum performance and proper linking with libraries like glibc.</p>
<p><a href="https://docs.conda.io/en/latest/">Conda</a> was introduced as an all-in-one solution to manage python environments for the scientific community.</p>
<p>It took a different approach. Instead of using a fragile process of compiling libraries on your machine, libraries are precompiled and just downloaded when you request them. Unfortunately, the solution comes with a caveat - conda does not use PyPI, the most popular index of python packages.</p>
<p>Conda has its own package index with multiple channels (<a href="https://anaconda.org/anaconda/repo">anaconda channel</a> is maintained by the creators of conda and the most reliable one). Anaconda channel isn’t as complete as PyPI and packages that do exist in both places are often few versions behind the PyPI. Other channels update packages faster, but I strongly suggest checking who maintains respective packages (often not library authors!).</p>
<p><img src="images/python-dependency-management/conda_list.png" alt="" title="Conda environments encapsulate python, non-python binary (openssl), and python (werkzeug) packages. You can see that activating different environments can swap all of these." /></p>
<p>Altogether, Conda is tackling these problems:</p>
<ul>
<li>Managing different python versions</li>
<li>Managing different environments</li>
<li>Installing python packages</li>
<li>Compiling and installing non-python packages (think OpenSSL, CUDA drivers, etc.)</li>
</ul>
<h4 id="what-are-anaconda-and-miniconda">What are anaconda and miniconda?</h4>
<p>Anaconda and miniconda are different distributions for conda tools. Miniconda aims to be as minimal as possible - it installs just python and the conda tool. Anaconda installs additional 160+ packages often used in data science workflows.</p>
<p>If you want a tight control of the environment you run, I suggest installing miniconda and building the environment with a bottom-up approach.</p>
<h4 id="how-does-conda-interact-with-pip-and-other-tools">How does conda interact with pip and other tools?</h4>
<p>Conda is a very powerful tool. It tackles many problems so it often clashes with other tools in some axes. It is possible to make conda work with other tools (<a href="https://stackoverflow.com/questions/50546339/pipenv-with-conda">with pipenv for example</a>), but it requires deeper understanding of both tools, the python package loading, and is not something used very often.</p>
<p>There are two conda setup I’ve found reliable:</p>
<ul>
<li>Conda as all-in-one solution</li>
<li>Conda for environment management and installing binary package + pip for python packages (<a href="https://www.anaconda.com/blog/using-pip-in-a-conda-environment">best practices for conda + pip</a>)</li>
</ul>
<h2 id="pipenv">Pipenv</h2>
<p><a href="https://github.com/pypa/pipenv">Pipenv</a> is a dev workflow tool, created by the author of popular requests package. Apart from making the common workflows slick and managing the file with requirements (Pipfile), pipenv tackles following problems:</p>
<ul>
<li>Managing different python versions (through pyenv, if installed)</li>
<li>Managing different environments</li>
<li>Installing python packages</li>
<li>Environment reproducibility</li>
</ul>
<p>It loads packages from PyPI so it does not suffer from the same problem as Conda does.</p>
<p><img src="images/python-dependency-management/pipenv_first_install.png" alt="" title="Pipenv is really easy to use. First time you run pipenv install, it will create a virtual environment and set everything up for you. It knows which environment to use next time by the directory path." /></p>
<h4 id="how-does-pipenv-work-with-pip-and-other-tools">How does pipenv work with pip and other tools?</h4>
<p>Pipenv is a wrapper around pip and several other tools and is meant to bring all the jobs under one umbrella. Installing packages with pip within an pipenv environment will work, but it will not automatically put it in the Pipfile and Pipfile.lock.</p>
<h2 id="poetry">Poetry</h2>
<p><a href="https://python-poetry.org/">Poetry</a> - “python packaging and dependency management made easy”. Poetry is most similar to pipenv and they often compete for users. Main problems poetry is tackling are:</p>
<ul>
<li>Managing different environments</li>
<li>Installing python packages</li>
<li>Environment reproducibility</li>
<li>Packaging and publishing python packages</li>
</ul>
<p>You can see that it’s not that different from Pipenv. It’s recommended to <a href="https://python-poetry.org/docs/managing-environments/">use it with pyenv</a>. Once you do that, it tackles all the problems pipenv does, but also helps with creating python packages and publishing them to PyPI.</p>
<p><img src="images/python-dependency-management/poetry_new.png" alt="" title="Poetry is more opinionated than pipenv. E.g., `poetry new` will create a minimal project structure. After that point, they are very similar." /></p>
<h4 id="how-does-poetry-interact-with-other-tools">How does poetry interact with other tools?</h4>
<p>Poetry complements pyenv and together they form a complete solution for managing your workflows. Same as with pipenv, it uses PyPI for installing packages so there is no need to use pip once you start using poetry.</p>
<h4 id="pipenv-or-poetry">Pipenv or poetry?</h4>
<p>If you wonder why there are two very similar tools, you’re not alone. The main technical difference is the way they resolve packages. It’s actually a very difficult problem and Poetry is superior in that dimension. When you’re installing a new package, it will figure out faster what exactly it has to do and generally, it will handle complex dependency graphs more gracefully.</p>
<p>My general advice is that you’ll be fine with either, just pick one if someone hasn’t done that already for the project you’re working on.</p>
<h2 id="docker">Docker</h2>
<p><a href="https://www.docker.com/">Docker</a> has nothing to do with python dependency management, but people often talk about it in the same context so it’s definitely worth exploring what it does.</p>
<p>Docker is a tool to create, run, and manage containers. You can think of containers as very lightweight virtual machines. There is no virtualization, but they are very isolated from the rest of your operating systems. It was created as a general solution to package production software and run it in a reproducible, isolated way in the cloud.</p>
<p>You can run any of the tools I’ve explained in the Docker container. The nice thing about Docker is that the isolation it gives you dodges several problems. For example, the usual setup is that you run each app in a different container. That means you can install different python versions in there and they won’t know each other. Also, there’s no need for any virtual environment management since apps are isolated by design.</p>
<p>Docker is a great innovation that happened to the way we run software in production, but I don’t recommend it as the solution for python dependency management problems on dev machines.</p>
<p>There are several problems people struggle with when using Docker for dev environments:</p>
<ul>
<li>It takes a significant performance hit on Windows and Mac OS</li>
<li>There is far more to learn than just basic conda/pipenv/poetry commands</li>
<li>Setting up IDEs to discover and debug app dependencies in Docker containers is often not trivial, which makes development more difficult</li>
<li>Installing libraries that deeply link with the underlying system (like CUDA drivers) can become quite tricky</li>
</ul>
<p><img src="images/python-dependency-management/dockerfile.png" alt="" title="Docker is completely agnostic of python or package management tools. This is an example of a Dockefile starting with base Python 3.6.3 image. Inside docker container, you can really use any of the solutions above. People often use just pip to install packages." /></p>
<h2 id="all-solutions-compared">All solutions compared</h2>
<div class="python-dependency-management-table">
<table>
<tbody>
<tr>
<td> </td>
<td>Installing python packages</td>
<td>Installing non-python packages</td>
<td>Managing python versions</td>
<td>Managing virtual environments</td>
<td>Environment reproducibility</td>
</tr>
<tr>
<td>pip</td>
<td>✅</td>
<td><span class="red">✖</span>*</td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>venv</td>
<td> </td>
<td> </td>
<td> </td>
<td>✅</td>
<td> </td>
</tr>
<tr>
<td>piptools</td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td>✅</td>
</tr>
<tr>
<td>pyenv</td>
<td> </td>
<td> </td>
<td>✅</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>conda</td>
<td>✅</td>
<td>✅</td>
<td>✅*</td>
<td>✅</td>
<td> </td>
</tr>
<tr>
<td>pipenv (+pyenv)</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
</tr>
<tr>
<td>poetry (+pyenv)</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
<td>✅</td>
</tr>
<tr>
<td>Docker</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>*</td>
<td>✅</td>
</tr>
</tbody>
</table>
<div class="table-notes">
<p>* Pip note: pip doesn’t handle installing non-python packages, but <a href="https://pythonwheels.com/">pip wheels</a> removed the need to compile packages locally for most libraries, on most architectures</p>
<p>* Conda note: Even though conda handles installing non-python packages, it can’t replace your system package manager (yum, apt-get) completely. Running your software on platforms like EC2 will still require installing some packages outside of conda.</p>
<p>* Docker note: since Docker is very much agnostic of Python, you need some other tool inside your container to do these jobs.</p>
</div>
</div>
<h2 id="usual-setups">Usual setups</h2>
<h4 id="unix-style-pyenv--pip--venv--pip-tools">Unix style: pyenv + pip + venv + pip-tools</h4>
<p>Composable set of tools, each tool solving one problem. I highly recommend this setup for two main reasons</p>
<ul>
<li>It is composable. You can start from plain requirements.txt and add tools as you decide to solve other problems from the table above;</li>
<li>It is based on pip, which is installed everywhere and is the standard for installing packages.</li>
</ul>
<h4 id="pipenv--pyenv">Pipenv (+ pyenv)</h4>
<p>Easy to learn, all-in-one setup for managing main problems around dependency management.</p>
<h4 id="poetry--pyenv">Poetry (+ pyenv)</h4>
<p>Same as pipenv, it brings a lot to the table with no major drawbacks.</p>
<h4 id="conda-alone">Conda alone</h4>
<p>Some people use the conda alone. The main problem with this setup is that some libraries are not available in conda channels so you have to resort to using conda + pip.</p>
<h4 id="conda--pip">Conda + pip</h4>
<p>Common setup, using conda for python version management, virtual environment management, and installing binary dependencies. Pip used for installing python packages. Unfortunately, I’ve mentioned it has its own problems, and conda in general is a very bulky tool.</p>
<p>This is often used because conda integrates very well with Jupyter through <a href="https://github.com/Anaconda-Platform/nb_conda_kernels">nb_conda_kernels</a> extension. I use just when I have to use conda in the environment somebody else has set up (like SageMaker).</p>
<div class="convertkit-form">
<script async="" data-uid="af03941f9e" src="https://prodigious-builder-7392.ck.page/af03941f9e/index.js"></script>
</div>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:conda-compiling-challenges" role="doc-endnote">
<p>It’s 2020, but reliably compiling software from source, on different computer setups, is still an unsolved problem. There are no good ways to manage different versions of compilers, different versions of libraries needed to compile the main program etc. <a href="#fnref:conda-compiling-challenges" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>I briefly describe each tool (pip, venv, pip-tools, pyenv, conda, pipenv, poetry and Docker), why it’s created and the problems it’s tackling. You can find a table summarizing all the information and the usual setups people use.Start structuring your code like a software engineer2020-05-24T00:00:00+00:002020-05-24T00:00:00+00:00https://modelpredict.com/start-structuring-code-the-right-way<p>You’ve entered the data scientist role and nobody told you that you actually have to write code like a software engineer? It’s classic. CS and data engineers complain that your code isn’t written “the right way”<sup id="fnref:therightway" role="doc-noteref"><a href="#fn:therightway" class="footnote" rel="footnote">1</a></sup>. Worst of all, you know they’re onto something, but nobody can help you apart from saying you’ll have to learn software engineering.</p>
<p>I’ll show you a design technique that will help you connect your presentational skills with creating better code structure. Reading a lot of code will help you get better, but you don’t have to get another degree to start structuring your code better.</p>
<p>I’ll also show you how to use that technique on real code examples I’ve found in production.</p>
<h2 id="progressive-disclosure">Progressive disclosure</h2>
<p>Progressive disclosure has nothing to do with software engineering. It’s coming from interaction design. Here is the best explanation I’ve found.</p>
<p><em>“Progressive disclosure is an interaction design pattern that sequences information and actions across several screens (e.g., a step-by-step signup flow). The purpose is to lower the chances that users will feel overwhelmed by what they encounter. By disclosing information progressively, interaction designers reveal only the essentials, and help users manage the complexity of feature-rich websites or applications.”
(<a href="https://www.interaction-design.org/literature/topics/progressive-disclosure">Interaction Design Foundation</a>)</em></p>
<p>It doesn’t say anything about the code, but having this technique in mind while structuring the code will help you end up with a more readable code. It will help you achieve two properties of the readable code that:</p>
<ul>
<li>Reveals information from abstract to specific. It does not overwhelm the reader.</li>
<li>Reads nicely, like a story, allowing you to jump into specific parts if you’re just interested in details.</li>
</ul>
<p>The nicely structured code reads like a good reference book. You read titles and descriptions of all chapters and you know what the book is about. If you want to know more about a specific topic, you read the chapter on that specific topic. Or part of the chapter. You don’t have to read the chapter about RedShift’s execution engine if you just want to know which windowing functions are supported.</p>
<p>It’s not a completely new skill for you, though. You’re already doing this when presenting your analyses.</p>
<p>Imagine a presentation structured the same way a lot of data science code is structured. A linear report on the process.</p>
<ul>
<li>Slide 1: For this analysis, we’re using RedShift. It has a lot of tables.</li>
<li>Slide 2: Unfortunately, we don’t have a record of sent invoices, but here is the SQL query I’ve used to approximate that.</li>
<li>Slide 3: Here is the second page of the SQL query. Notice that I had to use 4 joins to do that. It took a while to get the data.</li>
<li>Slide 4: Here are results shown in the table.</li>
<li>Slide 5: Here is the summary of our churn in one specific segment of our customers.</li>
<li>Slide 6: Here is the chart of our net new revenue.</li>
</ul>
<p>It just doesn’t make sense. No one structures their presentations like that.</p>
<p>Instead, you start from the top, and then drill into specific areas. If you don’t have enough time, you can skip sections and still deliver the main message.</p>
<p>A better structure:</p>
<ul>
<li>Slide 1: Here is our net new revenue chart</li>
<li>Slide 2: Let’s break it down by expansion, contraction, churn, and new customers.</li>
<li>Slide 3: Here is the breakdown of contraction by reason and customer segments.</li>
<li>…</li>
<li>Slide n: Thank you (and small link to technical implementation of analysis in case someone wants to drill further themselves).</li>
</ul>
<h2 id="structuring-the-data-science-code-with-progressive-disclosure-in-mind">Structuring the data science code with progressive disclosure in mind</h2>
<p>When the structure doesn’t naturally come out of the code I write, I ask myself a set of questions to help me find the structure. They are by no means comprehensive nor tackle completely different angles, but they help me remove common problems that make the code more difficult to read.</p>
<h3 id="q-how-can-i-describe-the-piece-of-code-to-somebody-else">Q: How can I describe the piece of code to somebody else?</h3>
<p>Let’s take a look at this piece of code. Try to get a general feeling of what it’s doing, not the details.</p>
<pre><code class="language-python">def predict(execution_date, templates_dict, **context):
db = PostgresHook(postgres_conn_id='redshift')
query_annual_deals = 'SELECT * FROM dw.ltv_annual_deals'
annual_deals = db.get_pandas_df(query_annual_deals)
logging.info('Got %d results for annual deals', len(annual_deals))
annual_deals.sort_values(by=['app_id'], inplace=True)
annual_deals.reset_index(drop=True, inplace=True)
annual_deals.set_index('app_id', inplace=True)
query_transactions = 'SELECT * FROM dw.ltv_invoices'
transactions = db.get_pandas_df(query_transactions)
logging.info('Got %d results', len(transactions))
# sorts and indexes dataframe
transactions.sort_values(by=['app_id', 'invoice_date'], inplace=True)
transactions.reset_index(drop=True, inplace=True)
# truncates 'cohort_date' on month into 'cohort'
transactions['cohort'] = transactions['cohort_date'].apply(lambda d: '%04d-%02d' % (d.year, d.month))
# calculates 'period', which is month difference between 'cohort_date' and 'invoice_date'
transactions['period'] = (transactions['invoice_date'].dt.to_period('M') - transactions['cohort_date'].dt.to_period('M')).apply(lambda p: p.n)
# checks if this 'period' has completed for all members of 'cohort'
transactions['is_period_complete'] = transactions['invoice_date'].dt.to_period('M') < pd.to_datetime('now').to_period('M')
# ...processing logic
</code></pre>
<p>One possible description of the code is “it loads annual deals, then sorts the data frame and sets the index. Then it loads all the invoices, then sorts loaded invoices invoices and sets the index.”
<br />It’s a very robotic and unnatural one, but the structure is here to point something out.</p>
<p>“It loads annual deals, <strong>then</strong> sorts the data frame and sets the index. <strong>Then</strong> it loads all the invoices, <strong>then</strong> sorts loaded invoices and sets the index. <strong>Then</strong> it calculates the cohort and sets period information”.</p>
<p>Let’s rewrite the code to read like the sentence above.</p>
<pre><code class="language-python">def predict(execution_date, templates_dict, **context):
annual_deals = load_annual_deals_from_redshift()
sort_and_set_annual_deals_index(annual_deals)
transactions = load_transactions_from_redshift()
sort_and_set_transactions_index(annual_deals)
set_cohort_and_period_information(annual_deals)
# ...processing logic
</code></pre>
<p>This reads better already. If someone wants to understand how I load annual deals, they can read the function implementation. Same for sorting and setting indices.</p>
<p>Different explanations of what the code does will transform the same code into different shapes, but you’ll likely come up with a similar explanation.</p>
<h3 id="q-can-i-group-some-operations-together-is-it-a-detail-or-very-important-piece-of-the-story">Q: Can I group some operations together? Is it a detail or very important piece of the story?</h3>
<p>Another question that helps me is whether I can group things together. Same as designers would group actions operating on the same product area, we can group together the code that’s contributing to the same more abstract process.</p>
<p>For example, sorting annual deals and setting the index looks like a detail about the loaded dataframe. It can be done while loading from redshift.</p>
<p>The structure I’d use for that code is</p>
<pre><code class="language-python">def load_annual_deals_df():
db = PostgresHook(postgres_conn_id='redshift')
query_annual_deals = 'SELECT * FROM dw.ltv_annual_deals'
annual_deals = db.get_pandas_df(query_annual_deals)
annual_deals.sort_values(by=['app_id'], inplace=True)
annual_deals.reset_index(drop=True, inplace=True)
annual_deals.set_index('app_id', inplace=True)
return annaul_deals
def load_transaction_deals_df():
db = PostgresHook(postgres_conn_id='redshift')
query_transactions = 'SELECT * FROM dw.ltv_invoices'
transactions = db.get_pandas_df(query_transactions)
transactions.sort_values(by=['app_id', 'invoice_date'], inplace=True)
transactions.reset_index(drop=True, inplace=True)
return transactions
def predict(execution_date, templates_dict, **context):
annual_deals = load_annual_deals_df()
logging.info('Got %d results for annual deals', len(annual_deals))
transactions = load_transaction_deals_df()
logging.info('Got %d results for transactions', len(transactions))
set_cohort_and_period_information(annual_deals)
# ...processing logic
</code></pre>
<p>You can see that I don’t use <code>load_annual_deals_from_redshift</code> and <code>sort_and_set_annual_deals_index</code>. Why is that?</p>
<ul>
<li><code>load_annual_deals_df</code> and <code>load_transction_deals_df</code> are still short and easy to read. They group together everything around loading the dataframe and preparing it for the rest of “predict” function. You can explain it as “it loads the annual deals from redshift and returns it in a dataframe.”</li>
<li>There’s a cost of extracting many small functions. Machines won’t care that much, but your reader will have to jump between functions. I reckon it doesn’t overwhelm the reader when read. If the code starts piling up in that function, I’ll look into breaking it up.</li>
</ul>
<h4 id="can-you-go-too-abstract">Can you go too abstract?</h4>
<p>You can apply this rule at different zoom levels and go abstract all the way. For example, you can structure it like</p>
<pre><code class="language-python">def predict():
annual_deals, transactions = load_data()
# ...processing
</code></pre>
<p>That’s still readable. We know that annual deals and transactions are results of loading the data.</p>
<p>What doesn’t bring much value is this:</p>
<pre><code class="language-python">def predict():
annual_deals, transactions = load_data()
process(annual_deals, transactions)
</code></pre>
<p>Process is too generic for a function name. It doesn’t mean anything. Programs are all about processing the data. It just added another step while exploring the code. Imagine you’ve clicked on a signup button and the first step of the flow was “This is a signup flow. You will have to type in your email address in the next step.”</p>
<h3 id="q-do-i-repeat-this-code-multiple-times">Q: Do I repeat this code multiple times?</h3>
<p>Another helpful question is whether the code is written multiple times. If it is, it’s possible that it represents something more generic that’s worth extracting.</p>
<p>An obvious example from the code I showed you is loading the dataframe from RedShift. Every time we load it, we write three following lines:</p>
<pre><code class="language-python">db = PostgresHook(postgres_conn_id='redshift')
query_annual_deals = 'SELECT * FROM dw.ltv_annual_deals'
annual_deals = db.get_pandas_df(query_annual_deals)
</code></pre>
<p>We can extract it into a method</p>
<pre><code class="language-python">def load_df_from_redshift(query):
db = PostgresHook(postgres_conn_id='redshift')
return db.get_pandas_df(query)
annual_deals = load_from_redshift('SELECT * FROM dw.ltv_annual_deals')
</code></pre>
<h3 id="q-does-this-belong-here-am-i-doing-something-similar-in-other-places">Q: Does this belong here? Am I doing something similar in other places?</h3>
<p>When I experiment in Jupyter notebooks, I don’t know what exactly I’ll need when I start writing the code. Here’s an example I’ve found:</p>
<pre><code class="language-python">events = load_from_redshift(f"SELECT * FROM events WHERE 1 = 1 AND ${extra_conditions}")
events['metadata_parsed'] = events['metadata'].apply(json.loads)
events['suggestion_type'] = events['metadata_parsed'].apply(extract_type_from_metadata)
# some other logic that does not change events['metadata_parsed']
events['trigger_type'] = events['metadata_parsed'].apply(extract_trigger_type_from_metadata)
</code></pre>
<p>Extracting trigger type and suggestion type are very similar. They’re extracting something from the metadata. In this case, I would not extract it in a function yet, but I’d definitely move them closer to each other.</p>
<pre><code class="language-python">events = load_from_redshift(f"SELECT * FROM events WHERE 1 = 1 AND ${extra_conditions}")
events['metadata_parsed'] = events['metadata'].apply(json.loads)
events['suggestion_type'] = events['metadata_parsed'].apply(extract_type_from_metadata)
events['trigger_type'] = events['metadata_parsed'].apply(extract_trigger_type_from_metadata)
# some other logic that does not change events['metadata_parsed']
</code></pre>
<h3 id="q-is-this-one-liner-easy-to-read">Q: Is this one-liner easy to read?</h3>
<p>Common pattern I see in DS code is packing a lot of logic into <code>df[col_name].apply(lambda x: ....</code>. It often forces the reader to read the implementation and then figure out the meaning from it.</p>
<p>Here’s an example from the code you’ve already seen:</p>
<pre><code class="language-python"># calculates 'period', which is month difference between 'cohort_date' and 'invoice_date'
transactions['period'] = (transactions['invoice_date'].dt.to_period('M') - transactions['cohort_date'].dt.to_period('M')).apply(lambda p: p.n)
</code></pre>
<p>Note that it comes with a comment. The comment is here to help you understand what’s happening. It allows you to skip reading the implementation of it, which is good (progressive disclosure!).</p>
<p>There is a better way to do the same, though</p>
<pre><code class="language-python">transactions['period'] = months_between(transactions, 'invoice_date', 'cohort_date')
def months_difference(df, col1, col2):
return (df[col1].dt.to_period('M') - df[col2].dt.to_period('M')).apply(lambda p: p.n)
</code></pre>
<p>There is no need for the comment anymore. If a reader wants to read the implementation, they will find the function and read it.</p>
<hr />
<div class="convertkit-form">
<script async="" data-uid="af03941f9e" src="https://prodigious-builder-7392.ck.page/af03941f9e/index.js"></script>
</div>
<h2 id="where-to-go-from-here">Where to go from here?</h2>
<p>I’ve explained the main principle behind structuring the code to read better. Having that principle in mind, it’s a “rinse and repeat” exercise. Write the code, read it, and refactor it by asking these questions. Find the inspiration in code someone else has written. Try using the same patterns in your code.</p>
<p><a href="/jupyter-writing-production-code-step-one">Extract some functions into separate .py files</a> to lighten the notebook. Don’t force the reader to read all extracted methods before it comes to the main analysis code.</p>
<p>And don’t forget that us engineers are often opinionated about the code. Sometimes we’ll argue because we prefer different styles, but we’ll have no good arguments to defend it. :)</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:therightway" role="doc-endnote">
<p>I often hear phrases like “the right way”. Such thing means nothing without the context. There are some properties that are good to maintain as long as the environment allows — code readability is one of them. <a href="#fnref:therightway" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>You've entered the data scientist role and nobody told you that you actually have to write code like a software engineer? It's classic. CS and data engineers complain that your code isn't written "the right way". Worst of all, you know they're onto something, but nobody can help you apart from saying you'll have to learn software engineering.Transform exploratory Jupyter notebook into production friendly code: step one2020-05-20T00:00:00+00:002020-05-20T00:00:00+00:00https://modelpredict.com/jupyter-start-writing-production-friendly-code<p>Yes, you already know Jupyter notebooks enable bad code design. They often have some hidden state, you execute cells out of order so notebooks are often not even runnable from scratch. And engineers running production complain. <strong>They often throw your code away and rewrite everything from scratch because it’s not “production code.”</strong></p>
<p>Still, we both know it’s hard to give up on notebooks. They are too useful for exploratory analyses, they help us move faster. Isn’t that a strong enough argument?</p>
<p>I have bad news and good news for you.</p>
<p>Bad news first—they’re right. I know it because I belong to both camps. I’m often using Jupyter Notebooks to experiment on new features or as an interactive console on steroids, but also I have to put all the code in production and make it tick.</p>
<p>Relying on jupyter notebooks to contain the code is difficult and fragile. The interactive stateful notebooks design breaks some fundamentals of how the software is developed.</p>
<p>The good news is that the code you write in notebooks is not tainted. It’s just written without testability and modularity in mind.</p>
<p>Here is a simple way to gradually change that! I’ll also show you how I do it with VsCode and switch between these two very fast.</p>
<h3 id="1-recognize-pieces-of-code-that-can-be-isolated-as-a-separate-unit">1. Recognize pieces of code that can be isolated as a separate unit</h3>
<p>This is the most difficult part as it requires some experience and domain knowledge. My approach is to start from the bottom, looking at pieces of code that are used in multiple places. Another question that helps is which functions I wish I had implemented by somebody else or a library.</p>
<p>For example, I’ve recently been generating some models based on a dataframe with a bunch of URLs. While preprocessing the data, I wanted to “compress” URLs into URL groups and save it to a new column. That seemed like a good piece to start.</p>
<h3 id="2-isolate-that-piece-into-a-function">2. Isolate that piece into a function</h3>
<p>Once you know what to isolate, create a new function and extract the logic there. Make sure all places in your notebook call the function instead of inlining the logic.</p>
<h3 id="3-move-the-function-into-a-py-file">3. Move the function into a .py file</h3>
<p>Cut the function from the notebook and paste it into the .py file. Import the file and call the function from there.</p>
<h3 id="4-turn-on-autoreload-in-your-notebook">4. Turn on autoreload in your notebook</h3>
<p>You’re experimenting and it’s possible you’ll want to change this function. Turn on autoreload in your notebook.</p>
<pre><code class="language-python">%load_ext autoreload
%autoreload 2
</code></pre>
<p>Every time you change your .py file, the notebook will get the new code.</p>
<h3 id="5-bonus-points-write-simple-tests-for-your-function">5. Bonus points: write simple tests for your function</h3>
<p>Nothing really prevents you from adding simple tests in your notebook. The support from testing frameworks is not great, but what I often do is use plain “assert.”</p>
<p>Example:</p>
<pre><code class="language-python">def square(a):
return a**2
assert square(2) == 4
assert square(3) == 9
</code></pre>
<p>There are two benefits from adding tests early, even in this crude format:</p>
<ul>
<li>It’s often easy to translate them into the testing framework you use.</li>
<li>It helps you structure your functions better.</li>
</ul>
<p>I don’t have a hard rule on whether I keep these tests in a notebook or in the .py file. I’m going to shape them to <a href="https://docs.python.org/3/library/unittest.html">unittest</a> framework anyway. Moving it to the .py file makes it easier to review the code, though.</p>
<h2 id="how-do-i-switch-between-jupyter-notebook-ui-and-vs-code">How do I switch between Jupyter Notebook UI and VS Code?</h2>
<p>This process would be very painful if switching between .py file and Jupyter notebook was slow.</p>
<p>I’ve found using a combination of VsCode for .py files and Jupyter Notebook UI in the browser very efficient. VsCode has support for Jupyter Notebooks, too, but I haven’t found myself equally productive in it yet. In VsCode, I open the directory that contains notebooks and other files (if on a remote machine, I use Remote Development). Here the quick video of it (full screen for better quality).</p>
<iframe width="680" height="382" src="https://www.youtube.com/embed/6BEfGAxwjtA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>If you’re using PyCharm, I’ve heard it supports something similar.</p>
<h2 id="wrapping-it-up">Wrapping it up</h2>
<p>Writing the production ready code is not what you think of first when you’re experimenting and the code and requirements change a lot. But that does not mean you have to start from scratch after the experimentation phase ends.</p>
<p>Instead, you can use the best of both worlds and start slowly transitioning towards production friendly code as you’re leaving the experimentation phase. First step in doing that is extracting functions into .py files</p>
<div class="convertkit-form">
<script async="" data-uid="af03941f9e" src="https://prodigious-builder-7392.ck.page/af03941f9e/index.js"></script>
</div>Yes, you already know Jupyter notebooks enable bad code design. They often have some hidden state, you execute cells out of order so notebooks are often not even runnable from scratch. And engineers running production complain. They often throw your code away and rewrite everything from scratch because it’s not “production code.”SageMaker: install Jupyter extensions in restart-proof way2020-05-12T00:00:00+00:002020-05-12T00:00:00+00:00https://modelpredict.com/sagemaker-jupyter-extensions-restart-proof<p><em>“Every time my notebook shuts down and restarts, I lose notebook extensions and have to reinstall them from the terminal”, my teammate said. Eventually, he gave up reinstalling them.</em></p>
<p>This might be fine if you’re using SageMaker occasionally, but if you’re using it every day like he and I do, it’s a bummer. It will slow you down, you won’t use the Jupyter setup you like.</p>
<p>This guide will show you how to <strong>install Jupyter</strong> (and JupyterLab) <strong>extensions and make them stay after notebook instance restarts</strong>. Like <a href="/tag/sagemaker">other guides on SageMaker</a>, it’ll take just a few minutes to set it up.</p>
<h2 id="setup">Setup</h2>
<p>The setup does not really make Jupyter extensions “stay” after notebook instance restarts. It installs them every time your instance boots. The effect is the same, though. You get your favorite jupyter extensions.</p>
<h3 id="preparing-lifecycle-configuration">Preparing lifecycle configuration</h3>
<p>To make sure we have the place to inject the installation commands, we’ll have to make sure there is a lifecycle configuration attached to it. No worries, here are steps to help with that.</p>
<ol class="tutorial-steps">
<li>
Login to AWS console
</li>
<li>
Stop your notebook instance (and wait for the instance to stop)
</li>
<li>
Make sure you have <a href="https://stedolan.github.io/jq/download/">jq tool installed</a>
</li>
<li>
<p>Copy the following code into your terminal (on your computer, not SageMaker).</p>
<p>If you know there is one and you know the name, you can just fill <code>CONFIGURATION_NAME</code> variables and skip to <a href="#configuring-the-extensions-to-install">configuring the extensions to install</a>.</p>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash"># fill in the instance name here
INSTANCE_NAME="team-ml-mario"</code></pre></div>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">CONFIGURATION_NAME=$(aws sagemaker describe-notebook-instance --notebook-instance-name "${INSTANCE_NAME}" | jq -e '.NotebookInstanceLifecycleConfigName | select (.!=null)' | tr -d '"')
echo "Configuration \"$CONFIGURATION_NAME\" attached to notebook instance $INSTANCE_NAME"</code></pre></div>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">if [[ -z "$CONFIGURATION_NAME" ]]; then
# there is no attached configuration name, create a new one
CONFIGURATION_NAME="better-sagemaker"
echo "Creating new configuration $CONFIGURATION_NAME..."
aws sagemaker create-notebook-instance-lifecycle-config \
--notebook-instance-lifecycle-config-name "$CONFIGURATION_NAME" \
--on-start Content=$(echo '#!/usr/bin/env bash'| base64) \
--on-create Content=$(echo '#!/usr/bin/env bash' | base64)
# attaching lifecycle configuration to the notebook instance
echo "Attaching configuration $CONFIGURATION_NAME to ${INSTANCE_NAME}..."
aws sagemaker update-notebook-instance \
--notebook-instance-name "$INSTANCE_NAME" \
--lifecycle-config-name "$CONFIGURATION_NAME"
fi</code></pre></div>
</li>
</ol>
<p>That’s it, we just have to define which extensions to install.</p>
<h3 id="configuring-the-extensions-to-install">Configuring the extensions to install</h3>
<p>Now we have to define a code to install your extensions every time.</p>
<ol class="tutorial-steps">
<li>
<p>Copy the code and fill the <code>EXTENSION_NAME</code> variable with the name of jupyter extension and <code>PIP_PACKAGE_NAME</code> with the name of pip package.</p>
<p>For example, git extension for JupyterLab is named “jupyterlab_git” and pip package name is “jupyterlab-git”.</p>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">export PIP_PACKAGE_NAME="jupyterlab-git"
export EXTENSION_NAME="jupyterlab_git"</code></pre></div>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">echo "Downloading on-start.sh..."
# save the existing on-start script into on-start.sh
aws sagemaker describe-notebook-instance-lifecycle-config --notebook-instance-lifecycle-config-name "$CONFIGURATION_NAME" | jq '.OnStart[0].Content' | tr -d '"' | base64 --decode > on-start.sh</code></pre></div>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">echo "Adding extenstions install to on-start.sh..."
echo '' >> on-start.sh
echo '# install jupyter extension' >> on-start.sh
echo "export PIP_PACKAGE_NAME=\"${PIP_PACKAGE_NAME}\"" >> on-start.sh
echo "export EXTENSION_NAME=\"${EXTENSION_NAME}\"" >> on-start.sh
echo 'curl https://raw.githubusercontent.com/mariokostelac/sagemaker-setup/master/scripts/install-server-extension/on-start.sh | bash' >> on-start.sh</code></pre></div>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">echo "Uploading on-start.sh..."
# update the lifecycle configuration config with updated on-start.sh script
aws sagemaker update-notebook-instance-lifecycle-config \
--notebook-instance-lifecycle-config-name "$CONFIGURATION_NAME" \
--on-start Content="$((cat on-start.sh)| base64)"</code></pre></div>
</li>
<li>Repeat step 1 for every extension you want to install.</li>
</ol>
<p>If you want to install nb extensions or JupyterLab extensions, use this <a href="/code_snippets/sagemaker/update_config_nb_extension.sh">code for nb extension</a> and this <a href="/code_snippets/sagemaker/update_config_nb_extension.sh">code for JupyterLab extensions</a>.</p>
<h2 id="why-do-my-jupyter-extensions-disappear-at-all">Why do my jupyter extensions disappear at all?</h2>
<p>After all this, why did we have to go through all the effort? Why do these extensions disappear?</p>
<p>The way AWS set up SageMaker is to save just files located in ~/SageMaker. Everything else is created from scratch, every time you boot your notebook instance.</p>
<p>Jupyter extensions don’t get installed there. They are installed in the JupyterSystemEnv conda environment, which is outside that directory.</p>
<hr />
<div class="convertkit-form">
<script async="" data-uid="af03941f9e" src="https://prodigious-builder-7392.ck.page/af03941f9e/index.js"></script>
</div>
<h2 id="troubleshooting">Troubleshooting</h2>
<h3 id="my-instance-is-failing-to-start-now">My instance is failing to start now</h3>
<p>It’s likely you’ve put the name of pip or extension package wrong. Follow steps below and remove the installation code.</p>
<h3 id="uninstalling-extensions">Uninstalling extensions</h3>
<p>Follow these steps:</p>
<ol class="tutorial-steps">
<li>
<p>Run this to download the on-start.sh script</p>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">echo "Downloading on-start.sh..."
# save the existing on-start script into on-start.sh
aws sagemaker describe-notebook-instance-lifecycle-config --notebook-instance-lifecycle-config-name "$CONFIGURATION_NAME" | jq '.OnStart[0].Content' | tr -d '"' | base64 --decode > on-start.sh</code></pre></div>
</li>
<li>Open on-start.sh in your editor and remove scripts extensions that you do not want.</li>
<li>
<p>Run this to upload changed on-start.sh.</p>
<div class="code-snippet line-numbers language-bash"><pre><code data-lang="bash">echo "Uploading on-start.sh..."
# update the lifecycle configuration config with updated on-start.sh script
aws sagemaker update-notebook-instance-lifecycle-config \
--notebook-instance-lifecycle-config-name "$CONFIGURATION_NAME" \
--on-start Content="$((cat on-start.sh)| base64)"</code></pre></div>
</li>
</ol>“Every time my notebook shuts down and restarts, I lose notebook extensions and have to reinstall them from the terminal”, my teammate said. Eventually, he gave up reinstalling them.