Reproducibility in PyPI
Why creating consistent Python builds can be difficult.
This is my second blog post on PyPI, the others are:
The Python Package Index (PyPI) is the main package repository for Python. Anyone can create an account and upload packages to pypi.org. PyPI consists of projects like ‘requests’ and ‘numpy’. A release consists of one or more distributions associated with a given version. I.e. ’requests’ version ‘2.31.0’ consist of the binary distribution ‘requests-2.31.0-py3-none-any.whl’ and source distribution ‘requests-2.31.0.tar.gz’.
You can specify the version when installing via Pip using ‘pip install requests==2.31.0’. Pip will default to downloading the binary distribution ‘requests-2.31.0-py3-none-any.whl’. The Wheel binary format specifies a set of tags, in this case the Python version ‘py3’, the ABI ‘none’ (not compiled against a C library) and platform ‘any’ (not specific to Linux, Mac or Windows). If the Wheel tags do not match the install environment or if the ‘--no-binary :all:’ is used, the source distribution will be used instead. The source distribution can perform arbitrary code execution at install (to enable compiling binaries etc.).
It’s possible to specify which tags to match (‘--abi’, ‘--platform’, ‘--python-version’) to narrow down which distribution to use. While ‘requests’ only have two distributions per version, ’numpy’ has 32 distributions for the release ‘1.26.0’.
The term reproducible builds is used to describe a build process that produces the same result every time. This is useful for consistency across different build environments (e.g. a developer laptop and a build server). It also enables the build process to be security audited.
There are a number of things that could prevent builds with PyPI dependencies from being reproducible:
It’s up to the maintainer to ensure that distributions with the same version are functionally equivalent. (example with 32 distributions for one version)
It’s possible to add distributions to an existing release (example with an 8 year gap). This means that even if the expected distribution is currently being selected, the project maintainer can add distributions in the future that will be selected instead.
It’s possible to specify a build number for binary distributions that would distinguish distributions that would otherwise be the same. It’s not possible to pin on this.
Distributions can be deleted, thus breaking the build outright or changing which distribution is selected.
Distributions can be mutated (more in the next blog post)
Projects can be deleted by the maintainer and picked up by someone else.
What about hash pinning?
By default pip-tools (common to use with Pip) and Poetry will add the hashes of all distributions for a given version. So while the version is pinned, it could lead to any number of distributions to be installed, including the source distribution. Hash pinning will prevent new or changed distributions as long as the hashes are not resolved again. ‘pip-tools’ appears to not add new hashes to the trusted set, while ‘poetry update’ does add the new hashes. Regardless, when you update to a new version any trust on existing hashes is not carried over to the new set of hashes, so rather than Trust On First Use (TOFU) we have Trust On Every Update (TOEU), which could lead to unexpected results. Let me know in the comments if there is already a term for what TOEU is trying to describe.
What about signatures?
In May 2023 it was announced that GPG signatures would be removed from PyPI. So there is no longer an official way to distribute signatures, and signed metadata via The Update Framework (TUF) is still work in progress. For the time being there is no official alternative to trusting the hashes returned from PyPI.
I think immutable releases and reproducible builds through pinning are worthwhile goals. Today it’s up to the project maintainer to create such an experience since it’s not enforced by PyPI itself.
Developers should use hash pinning for dependencies. If a tool that updates hashes adds or removes hashes for an existing release (version), then the behavior of the release might have changed and warrants a closer look.
Package managers can be more loud if unknown hashes are observed or if expected hashes are missing (meaning that distributions have been added or removed from an existing release)
It would be nice to have immutable releases in PyPI, i.e. distributions that cannot be added/deleted after the initial release of a version.