How to trust a Python library?

As an IT professional with a mind-set based on open-source software, Python is the most used language for me in every day. Python is a natural choice for a broad spectrum of tasks, from system operations to machine learning. The reason for that is, obviously, the libraries. Easy to install, well-prepared documentation, community-driven development, and so on… But, how we can trust a library? Is it enough to install via pip or should we read the entire code itself? Making some black-box tests? The first result from Google suggests that? A couple of friends made a few PR to the repository? We can add lots of more items to this check-list. Let’s find out some answers.

First of all, how do you install libraries? I guess, it has a strong relationship with your OS selection. I’m using GNU/Linux distros for more than 10 years. I’ve used the following methods to install python libraries so far:
1. apt install python-library
2. pip install library
3. conda install -c channel library
4. git clone library / python setup.py install

Each one of them represents another experience milestone in my career. I learned not to install libraries system-wide in multi-user servers, the necessity of separating project dependencies, creating pipelines for one-click-deployments, and much more.

Besides my self-taught process of how to use python libraries in the most reliable and the most pythonic way, I looked for answers to the main question in blogs, forums, and Reddit.

Almost one year ago, there was a big flaw related to dateutil and jellyfish libraries.

Issue on GitHub: https://github.com/dateutil/dateutil/issues/984

A fake version of dateutil library (python3-dateutil) in PyPI tries to import jellyfish library (jeIlyfish) which is also fake. People behind this “hack” might be some masterminds however, the hack itself is not. They just changed the first “L” with “I” and sent SSH and GPG keys to their server. The interesting part of this hack is python3-dateutil is almost the same as the dateutil except that it tries to import jeIlyfish. This hack discovered on the same day and malicious libraries removed from PyPI. As you can see, if you’re on to read entire source codes, you need to it for all dependencies also.

The Python Package Index (PyPI) is a repository of software for the Python programming language. PyPI helps you find and install software developed and shared by the Python community.

What is PyPI? – https://pypi.org/

There are lots of typosquatting incidents back in the years. In 2018, 12 libraries discovered by a security researcher and he released his analysis on his blog.
https://bertusk.medium.com/detecting-cyber-attacks-in-the-python-package-index-pypi-61ab2b585c67

Image for post
List of detected typosquatting libraries.

Typosquatting is also argued in the python packaging group and looked for solutions after the detected cases.

I think I made my point on pip packages. Now, let’s dive into the conda / anaconda environment.

Anaconda can be considered as both cathedral and bazaar. It has an official repository with the name of Anaconda Repository. All the libraries maintained by Anaconda Inc. itself. In addition, there are channels. Individuals can create their own channel and store desired versions on there. For example, the most used computer vision library, opencv. You can install it from lots of different channels with conda. You can check the list from here: https://anaconda.org/search?q=opencv

This time we already know that those libraries are differing both in terms of source code and versions. Is it enough for you to continue because there are already N (numbers on the left represents # of downloads) people are using it? Almost all of the documentations about image processing with python recommends OpenCV. But they usually do not mention which is safe or not.

Using GitHub repositories is worser than using other channels in conda. Similar to, using apt / pac / dpkg packages is not different from the official Anaconda Repository.

It’ll be better to split this topic into multiple posts to cover all the details. Introduction of the problem, example cases are covered. Next post, I’ll try to explain how those malicious codes are detected and tested in those previous examples.

Leave a Reply

Your email address will not be published. Required fields are marked *