January 29, 2024

Infield's Data Pipeline

Brian Boylen
Founding Engineer

As part of Infield’s mission to make upgrading open source dependencies easy and safe, we’re building a comprehensive dataset of open source packages and metadata about each package version. We use LLMs and human-in-the-loop changelog review to categorize breaking changes for each version, then enrich this data with upgrade experience data from the community. This post will go into more detail about how we’re gathering this data.

Many programming languages have a centralized database where information about packages are stored. For example, Javascript has the npm registry and Ruby has the RubyGems database. We fetch our package information from these repositories, but how we fetch this information differs by ecosystem.

The RubyGems website provides a nightly SQL dump of all of their gem data. We stay current by fetching this data every night and updating our own package and package version data accordingly. The npm registry doesn’t have an equivalent way to download information for all of its packages, but it does provide an API to fetch a single package’s information. Since there are over 2 million packages on the npm registry, we have decided to not preemptively fetch every single package. Instead, when a user installs Infield for a Javascript repository, we fetch any npm package that doesn’t already exist in our database. Similar to Ruby, we run a nightly task that refreshes every npm package we store, so that we always know about new versions and patches.

Once we have package-level information, we can grab the changelogs so that we can analyze breaking changes. We consider a package version to be fully researched when we have classified all of a version’s changes according to our own classifications, the primary of which is classifying the change as “Breaking” or “Non-Breaking”. We also give the change a type classification, such as “Added”, “Removed”, “Changed”, “Deprecated” or “Fixed”. At this point, Infield is able to recommend upgrades that include this package version, since we can show users the exact breaking changes, if any, they need to worry about. 

In order to do this research, Infield first needs to find the relevant changelogs for a package. This information is often stored in a file in the project repository titled something like “CHANGELOG.MD” or “CHANGES.MD”. In these cases, it is simple for us to make a request to the GitHub API (or whichever code repository service the project uses) to fetch the file. While this is a simple process for the average changelog that follows common standards like the Keep a Changelog format, it becomes more complex for changelogs stored in idiosyncratic formats. As we come across different ways developers store or write their changelogs, we update our system to read these patterns automatically.

After fetching the changelog we need to read it programmatically. This is where we use large language models (LLMs), currently OpenAI’s GPT-4. To get the changelogs ready for classification by the LLM, we first parse the changelog text file and split it into different chunks for each individual version of the package. Now we can feed this version-specific changelog chunk into the LLM, along with a prompt that explains to the LLM how it should classify each change and how we want the output structured. For every change within the changelog, this will output a summary of the change and the “Breaking / Non-Breaking” classification, among other things. If we do this for every version of a package, then we will be able to confidently suggest any upgrade for that package, letting users know which breaking changes, if any, they need to worry about.

In the interest of strengthening the open source ecosystem, we freely display all of this package research on our public package pages. For any package we have research on, you can select a version range and get a comprehensive list of every breaking change between those two versions. If you are planning on doing an upgrade, you can check the public package page on Infield to aid you in researching and planning your upgrade.

Beyond data in the changelog itself, we also collect undocumented incompatibilities between packages, which we gather from customer experiences and publicly available writing. While most maintainers strive to have complete documentation, there can be incompatibilities that the maintainer isn’t aware of or hasn’t yet documented. Every time an Infield user runs into an undocumented incompatibility, we add it to our database for future users to benefit from.

For example, we encountered an issue for a user upgrading to Rails 6 where the app would not launch due to an error with ActiveRecord and the Arel gem. Rails enthusiasts may remember that in v6.0 of ActiveRecord, the Arel gem was merged into ActiveRecord, instead of being required as an external dependency as it had been previously. This caused an issue with the upgrade we were doing, since one of the other dependencies required the standalone version of Arel, which caused a conflict with the version of Arel that now lived within ActiveRecord. To fix this we had to remove the dependency that required Arel. We then stored this incompatibility between the external Arel gem and ActiveRecord 6.0, so that any future users upgrading to Rails 6.0 will be aware of this issue and will know which packages they need to upgrade or remove to fix it.

While we think the existing data we gather and surface to users is tremendously useful, this is just the beginning. If you have any suggestions or feedback, send us a note at hello@infield.ai or find us on twitter/X at @infieldai.