Distribution Confusion in PyPI
A new way to distribute malicious Python packages.
This is my third blog post on PyPI, the others are:
The Python Package Index (PyPI) is the main package repository for Python. Anyone can create an account and upload packages to pypi.org. Packages can be downloaded and installed using package managers such as Pip and Poetry. When installing a package, there are several ways to get something else than intended, including Dependency Confusion and Manifest Confusion (more about that in the next blog post). Here we will explore a new technique I will call Distribution Confusion (let me know in the comments if there is already a term for this). I will refer to all of the ways to get something else than intended as Resolution Confusion (let me know in the comments if there already a term for this).
In the last blog post I talked about how installing a specific version of a package, say version ‘1.26.0’ of ‘numpy’ via ‘pip install numpy=1.26.0’ can lead to a number of things to happen rather than the same every time: either one of 31 different binary distributions would installed, or it would install from from the source distribution (that allows for arbitrary code execution). Distributions can also be added or removed over time, further making it hard to have consistent behavior. Any of the distributions could have inconsistent or malicious behavior. In this blog post we’ll go a step further and see that we can also create more than one variant of the same distribution. This enables us to replace the behavior of all existing distributions or change the behavior in targeted ways.
Distribution Confusion and Pinning Bypass
Distribution Confusion refers to the ability to add multiple files that all equivalent in terms of project, version and tags. Project publishing permissions are required to exploit this. Which of the Distribution Variants are actually resolved will depend which package manager is used and if hash pinning is in use. While version pinning is by many considered to be a best practice, we will see that hash pinning in the only effective approach in Python.
The lexicographical ordering of the Distribution Variants matter. With no hash pinning and after sorting, Pip will take the last of the duplicate distributions, while Poetry will take the first filename. The order of the files is based on what is returned by Warehouse which is sorted by the key (version, filename). Thus we can sandwich the original distribution between two malicious variants: one for Poetry and one for Pip:
One way to add a Distribution Variant before another is to change the characters in the filename to uppercase, e.g. ‘Requests-2.31.0.tar.gz’ would sort before ‘requests-2.31.0.tar.gz’. One way to add a Distribution Variant after another is to add leading zeros to the version, e.g. ‘requests-02.31.0.tar.gz’ would sort after ‘requests-2.31.0.tar.gz’. If the project name does not only contain alphanumeric characters, there are also several ways to normalize those characters into equivalent filenames.
There are a number of different ways to add variants to get different effects out of it. We will stack the variants creating different attack sandwiches. The first sandwich is the one we already looked at above:
PEP 527 Bypass
There is no limit to how many binary distributions we can add to a version of a project. PEP 527 says that there can only be one source distribution for a given version, and there is a check for this in PyPI. In order to make malicious variants of source distributions this restriction must be bypassed:
The distribution type was not verified against the file extension. So it was possible to add ‘.tar.gz’ or ‘.zip’ files as binary distribution instead. The package managers only look at the filenames so the incorrect metadata would not matter. This bypass was patched in #14243.
The version in a filename is not verified against the manifest version. Since Pip only cares about the filename, the duplicate source distributions can be added to another version. This has not been patched. For source distributions, implementing PEP 625 is planned and similar verification can be done for binary distributions (Wheel).
In general it’s not hard to detect Distribution Confusion by looking at all distribution filenames in PyPI (historical data can be a bit tricky as covered in #14371)
Package managers could detect cases based on the local state and the state of PyPI when fetching the install candidates. Currently it’s only possible to detect new unknown hashes:
$ pip install -r requirements.txt -vvv
Checked 3 links for project 'requests' against 1 hashes (1 matches, 0 no digest): discarding 2 non-matches:
$ poetry install -vvv
Skipping Requests-2.31.0.tar.gz as sha256 checksum does not match expected value
This research was shared with the PyPI maintainers in June/July 2023, partially patched, and cleared for publication. Thanks to the PyPI maintainers for how they handled the reports. I would also like to thank Stig Palmquist and others at Hackeriet for inspiring and giving feedback on this work.
Developers should use hash pinning for dependencies and review updates to the set of trusted hashes.
PyPI should verify versions in filenames against the version in the metadata as well as not allowing duplicate distributions. This is the plan for source distributions with #12245, and binary distributions #14602.
Package managers can to a certain extent detect mutable releases and distribution confusion client side and could warn or abort accordingly.