Begot: a Go dependency manager and build tool

We’ve recently been using Go to build many of our new systems at Solano Labs. Like most Go users, we were attracted by the language’s simplicity, reliable concurrency support, and relative maturity. And like most Go users, we quickly discovered all the areas where Go is lacking today, the most prominent of which is build and dependency management.

We eventually solved our problems by doing what it seems like just about every serious user of Go has done: writing our own build and dependency management tool. We call ours begot (because dependency trees look a little like ancestry trees).

How we use Go

Before talking about our solution, I’d like to lay out some context and the problems we’re trying to solve. Consider one of our projects with this simple dependency structure:

dependency graph

  • solano_build_agent depends on solano_common and Gorilla‘s mux library.
  • solano_common also depends on mux, as well as goamz.

Note that both repos solano_build_agent and solano_common are private git repos. solano_common contains code that’s shared among our various Go projects.

Imagine you’re a new developer working on solano_build_agent and you’d like to clone the repo and start hacking.

Here are a few of the problems you’ll run into if you try to use the standard go tool naïvely:

  • go get will not be able to automatically clone the solano_common repo, because it requires authenticated access to GitHub. This is true even if you have an ssh key that GitHub accepts loaded into your ssh-agent, because the go tool always constructs https git urls.
  • Once you clone solano_common manually and continue, you’ll get the latest versions of mux and goamz. But these versions might be different from the versions you got on another machine, and different from other developers on the team.

Alright, those problems are well-known, and some may argue that Godep-style vendoring can solve them completely (although we believe there are better solutions). What else?

Well, the dependency tree is actually a little more complicated. A while ago, goamz was missing some slightly specialized functionality that we needed, so to make progress quickly, we forked it and added some code. Forking a Go library isn’t so trivial, though! GitHub makes the first step simple, but the various packages in the goamz repo refer to each other with import paths that contain the full path to the original repo. So to actually use the fork, you then have to do a global search-and-replace on all the intra-repo import paths.

Besides being a task that would ideally be handled automatically, rewriting import paths also makes it harder to make a pull request to the upstream repo, because you have to prepare a branch with your changes but without the import path rewrite.

And that’s just for the forked repo. You also have to rewrite import paths in your code that depends on the forked repo in order to use it. And then when your changes are accepted upstream and you want to deprecate your fork, you have to rewrite them back again.

There’s one more related problem with forking: suppose we also depended on a package that depended on goamz, for example, https://github.com/ryansb/af3ro. If we don’t do anything more, af3ro would use the original goamz, which would lead to two copies of goamz’s s3 package in our binaries. This might work, or it might not, depending on how that package works, but either way it’s wasteful. To avoid duplicating that code, we’d have to fork af3ro also and rewrite its import paths to use our forked goamz.

This is starting to sound like a mess. How did what looks like such a simple scheme (using a URL as an import path) lead to all these problems? There are several fundamental reasons:

  • An import path like "github.com/goamz/goamz" only names a location of a repo, not a particular version of the code in that repo. The gb project has an excellent description of this problem. To enable repeatable builds, a specific version of the dependency must be recorded somewhere.
  • Repeating that import path in every file of a project that depends on goamz is highly redundant information. As with most instances of redundancy in software engineering, this makes things (project dependencies, in this case) brittle. There’s no need to repeat a full path hundreds of times over; it can be factored out into one location that’s more easily updated, say, to point to a fork.
  • A project contained in a single repository hosted on GitHub might be composed of multiple Go packages with dependencies in between them. These imports have to specify the fully-qualified path to the repo even though they’re referring to other packages within the same repo! This is like building a web site without relative links: it makes it hard to move code around or work with forks of the repo.

How begot works

The first thing we’d like to do is break the connection between import path and code location by letting import paths be arbitrary and chosen by the importing project. This makes lots of things better:

  • We can use short and meaningful import paths in the actual Go code.
  • We can document our dependencies in one place.
  • We can attach metadata to each dependency, such as branches, versions, and specific revision ids.
  • We can specify a git URL scheme so we can use SSH-based URLs for private repos.

We’d like our import paths to use nice short names, but we do have to eventually resolve them to an SCM repo or code location of some kind. Other languages use central registries for this sort of thing, e.g. pip or npm. We think the Go designers were correct in steering away from that sort of system. Instead, we let every project define their own mapping.

That import name mapping, and all the other dependency metadata, goes in a file at the root of the project. We realize that some people might be turned off by this. Overall, we agree that metadata-free builds are a laudable ideal, but in practice, they seem to be unattainable. Builds need metadata, and the metadata has to live somewhere. The most sensible place for it is in a file within the repo.

This file, named Begotten, might look like this (following the example above):

The keys of the deps map are the import paths that we use in our Go code.

Once we have a file listing all our dependencies, we can create a tool to perform operations on them. The basic ones are:

  • begot update fetches the latest version of all (or some) of the project’s dependencies and writes their specific revisions to Begotten.lock.
  • begot fetch fetches all the project’s dependencies without changing any locked revision ids.
  • begot build assuming all dependencies have been fetched, builds all the binaries in the project, ensuring that all dependencies point to the specific revision ids in Begotten.lock.

In addition to import paths using names in the Begotten file, import paths can refer to packages within the project repo, named with their path from the root. So task/task.go can refer to runner/runner.go with just import "runner".

Repo aliases

What about forking? What can we do to make that easier?

What we’re really looking for is a way to say that references to one package should be redirected to another one, across all the code used by this project, including all dependencies. So let’s say it:

When we add a key to repo_aliases, begot ensures that all references to goamz/goamz will actually end up pointing to our fork, including our explicit reference in the deps section, and all dependencies that use import paths like github.com/goamz/goamz/s3, even the code in our fork itself! So we don’t have to make any changes to the forked code, nor change the path by which we import goamz. We just need to add one line to our Begotten file.

Updating dependencies

So what does it feel like to use begot day-to-day? Mostly it’s just a matter of switching from go install ./... to begot build, and go get ./... to begot fetch. The more interesting parts are when your dependencies are changing.

New code in a dependency

First, let’s suppose you want to add a new function to solano_common and use it in solano_build_agent. To start, you’d clone solano_common and run begot fetch to get its dependencies, then begot build just to check that it compiles. (In practice you’d probably already have a local clone.) Then you’d add your new code and a test and make sure it passes. Then push to your central git server. For now, let’s say that you push directly to the master branch.

Now, to use it in solano_build_agent, you’d run begot update solano/util in that project’s directory (the arguments to update are local import paths, or no args to update all dependencies). That will write out a new Begotten.lock file pointing to the revision of solano_common that you just pushed. You can now write code using the new function and commit it, either in the same commit as the dependency change or in a following one. (Your development team may have a policy about always making dependency changes in separate commits to make it easier to understand history and revert changes.)

Note that in this simple case, we’ve made as few as two commits in total, which is the same number we would need to make with a vendoring-based approach. Given that we’re adding code in one repo and using it in another, we can’t get away with fewer than two.

Now suppose the change to solano_common had to be made on a feature branch and a pull request submitted for review. In that case, while the pull request was pending, you could edit the Begotten file in solano_build_agent to point to that branch, and then run begot update as before. It would look like:

Most likely, the changes you’re working on to solano_build_agent would also be on a feature branch. Presumably, the changes to solano_common would got merged before the current changes, so you could remove the ref: add-my-feature line and begot update once more before merging. (And you could even squash the git history so the temporary change to Begotten doesn’t appear.)

Update of a dependency’s dependency

Second, suppose that you want to update solano_build_agent to a newer version of gorilla/mux. As mentioned above, solano_common also depends on mux. If you try to begot update gorilla/mux in solano_build_agent by itself, begot will complain that the requested revisions of mux now conflict: solano_common depends on one revision (an older one) and the current project depends on a newer revision.

In other worse, if a dependency is using begot, begot will ensure that it’s fixed to a specific revision of each of its dependencies!

This is stricter than many other dependency management systems, but it provides the least surprises for the user: if you’ve built some code with begot build and tested it, you’ll always get exactly that version of all your dependencies, no matter whether you’re working on that project, or some other project that depends on it (even indirectly).

In practice, we haven’t found this strictness to be too onerous. All you need to do in this case is go to solano_common and do begot update gorilla/mux there first, then test that that package works with the newer version of the dependency. Then commit and push that change. Now you can go back to solano_build_agent and begot update solano/util (or any other package from the solano_common repo. That will automatically update gorilla/mux as well.

Note that the minimum number of commits here is two again: one for the dependency and one for the main project. In this case, we believe that it’s important to have some sort of record in the dependency’s history that it was tested with a newer revision of its dependency and that is now the recommended revision to use it with, even though the original goal wasn’t related to that dependency at all. While a less strict resolution scheme might get away with one commit, the second one does provide some value.

Having said that, if you have a complicated web of dependencies that are all changing quickly, this strict approach will probably not scale. We have been thinking about ways to relax it and let a project override some of its dependencies’ dependencies, but haven’t implemented anything yet. If you have ideas along these lines, get in touch.

Implementation details

In practice, begot build is a wrapper for the go tool. In order to have complete control over the inputs to go, begot manages a pair of workspaces for each project, one workspace for dependencies and one for the project’s own code. The dependency workspace is actually composed of symlinks to a shared cache of dependencies. The sharing is so that common dependencies don’t have to be downloaded multiple times.

All the code in the dependency workspace has its imports scanned to identify implicit transitive dependencies, and then rewritten to point to names that begot has assigned to them. The import rewriting is automatic and is even cached to speed up builds when dependencies haven’t changed. The rewritten code is never seen by the user and never sent anywhere.

There are a few more tricks involved that we plan to discuss in future blog posts.

How begot doesn’t work

Although begot has served our needs well for several months now, it’s far from perfect, and there are a bunch of areas that we plan on improving:

Code mirroring (i.e. safety without vendoring)

One advantage of vendoring-based approaches is that by having all the code embedded in the main project, you’re protected from situations where a third-party repository is unavailable, or a maintainer removes specific commits that you depend on. In the current climate of having everything hosted on maybe-not-as-reliable-as-you-wished GitHub and the popularity of depending directly on repositories owned by other organizations, this is a very valid concern.

However, vendoring dependencies also has many downsides, including repository bloat, discarding version control metadata (making it hard to follow the history of dependencies), and difficulty of temporarily forking dependencies.

We believe that a hybrid approach can solve this problem best: we’re planning to write a tool that’s separate from begot, but integrated with it, that will be responsible for mirroring third-party repos to a service under your control. That might mean your own GitHub organization, or a different git hosting service entirely, for more diversity of service providers. The begot update flow would change slightly: instead of pulling directly from origin repos, the mirror tool would first sync your mirror with the origin, then pull from the mirror to your local cache. A new directive in the Begotten file would specify the location of the mirror, and begot would use prefer to use the mirror for most operations, but be able to fall back to the original origin repo if the mirror is unavailable.

More flexible transitive dependencies

As discussed above, begot’s handling of transitive dependencies is very strict. We probably want to be able to specify allowable versions in the form of ranges of commit ids or semantic-version-formatted tags. There’s a lot of different options here, and the final result will depend on experience using begot on larger dependency graphs.

Concurrent operation

Some effort has been spent making begot able to run git operations and code rewriting concurrently, which greatly reduces the run time of typical begot fetch and begot update operations (begot build runtime is dominated by time spent in the go toolchain).

Better error messages

Currently, begot’s version conflict error messages are not particularly useful when trying to figure out where a conflict is coming from. Some more bookkeeping to keep track of all paths to a particular dependency will help here.

Other SCM systems

Begot only works with git for now. It uses a few git tricks that make supporting other SCMs not quite trivial, but it should happen eventually.

Post a Comment