Anatomy of a Clean Open-Source Project

Mon Mar 18 11:54:47 EDT 2019

Tags: open-source

Over the years, initially thanks to Peter Tanner's diligent work as the OpenNTF IP Manager and now my own occupation of that seat, I've learned to really appreciate the virtues of dotting your "i"s and crossing your "t"s when it comes to making an open-source project legally clean and clear.

It's definitely something I underrated early on, though - caring about the specific differences between licenses and, in particular, maintaining things like per-file license/copyright headers felt like annoying busywork. For a project that only you will ever use, it technically is, but the hope of open source is that you'll get other people using your work and, ideally, contributing back in turn, and that's when it's important to make sure you have everything sorted out.

Why Bother?

Well, for one, you or your users could theoretically be sued or otherwise legally entangled if you don't keep track of this stuff. Admittedly, it's fairly unlikely, but the consequence of, for example, unknowingly including GPL software in your proprietary product is potentially significant.

It's for that sort of reason that it's important to make sure everything is clean before some large corporations will risk even looking at your project. IBM is particularly good about this because they were significantly burned in the past, and came out of it with extremely-strict view, and that rubbed off on OpenNTF both culturally and with their gracious technical assistance along the way.

And, since large consumers require this sort of vetting, it's also important to know how to do it if you want to contribute code to an open-source organization like OpenNTF, Eclipse, or Apache.

It's also surprisingly satisfying once you get into the swing of it, I've found.

The Example Project

Since I've been spending a lot of time recently with the NSF ODP Tooling project, we'll look at its GitHub repository.

Common Files

There are a couple common features that tend to show up, and which both people and tooling (like GitHub's license identifier) look for:

  • The LICENSE file, which is the most critical. This contains the text of the license you're using, as well as one of the declarations of the copyright year (though, admittedly, it's easy to forget to include that part). This is what declares the effective license for the code in the repository that you own the copyright to, and should be included right from the start if possible.

  • The NOTICE file, which is vital if you're including any code from any sources not covered under the main copyright. This file should list all of the third-party code you have included in the repository, its license type, and, if possible, where to acquire it. If your project's distributable form includes additional third-party code not included in the repository (such as Maven or npm dependencies), these should be enumerated here as well

    • Writing this file has an important side effect in that it forces you to account for the licenses of your dependencies. More than once, I've run into a situation where I found that a common dependency had an incompatible license (such as the pure GPL). In some cases, this has meant abandoning the dependency outright, while in others it has meant finding a better-licensed alternative. Eclipse Orbit exists in large part for this purpose.
  • A legal directory containing any additional license/redistribution information not covered by the NOTICE. This can also sometimes take the form of files like NOTICE-Weld in the root of a project, and is useful for mass-including copyright/notice information from third parties in their original form.

In addition to including these files in the project repository, you should also make sure to include them in any binary distributions you make. In my projects, this takes the shape of inclusions in a Maven Assembly Plugin packaging file.

File Headers

I originally chafed against the idea of per-file copyright/license headers. They're not strictly necessary when the files are included in the original repository, they're redundant, you end up with massive commits touching hundreds of files just to change a year, and they can dwarf the size of the actual code they're copyrighting.

However, I've really come around to the practice of including them, and the main reason is that it makes the files easier for others to copy and use legally. It's one thing when someone finds their way to the root of your repository or distribution package, but it's another when they find an individual class by doing a web search or hitting F3 in Eclipse. In those cases, they can find their way up to the license (assuming your source package includes it), but it's much easier if it's just declared right up front.

It's also easier to clearly distinguish the third-party code you're including. When each file has its copyright information clearly noted, you can easily tell the difference between a sui-generis project file and an included third-party file without having to parse through the NOTICE every time.

And, fortunately, it doesn't have to be a huge hassle to maintain. In each of my Maven projects, I include a license plugin configuration to declare copyright information, any special data types, and which files to not include. Then, whenever I add new files or make a change after the turn of a year, I can run mvn license:format and it'll keep everything tidy for me.

pom.xml Configuration

Maven (and it's not alone in this) provides a lot of pom.xml-level elements to declare all sorts of metadata about your project, like its SCM repository, issue tracker, and, critically, license and developers. I like to declare the inception year, the license, and the <developers> block:

	<inceptionYear>2018</inceptionYear>

	<licenses>
		<license>
			<name>The Apache Software License, Version 2.0</name>
			<url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
		</license>
	</licenses>

	<developers>
		<developer>
			<name>Jesse Gallagher</name>
			<email>jesse@frostillic.us</email>
		</developer>
	</developers>

I use that <inceptionYear/> value as part of the license-header file to keep track of the copyright range, at least when there's contiguous multi-year development.

OSGi Stuff

Since most of my projects are still OSGi, I've been aiming to improve my licensing setup there too. The main place where this comes into play is in feature.xml files, which have required elements to specify the copyright and license. It's not terribly unusual for these to end up as their "[Enter Copyright Description here.]" defaults, but it's important to fill these in. They're included in the "accept the licenses" dialog when installing into Eclipse/Designer, and are available in the "Installed software" descriptions in the UI.

But What License to Use to Begin With?

I'll finish off this post with what is actually the most important part of the process, but which can usually be answered simply. There are a lot of open-source licenses out there, and you could theoretically make up your own, but for our purposes the choice tends to follow some basic rules:

  • If you're contributing to an established OS organization, use theirs - for example, if you're contributing to Eclipse, use the EPL.

  • If you want your code to be mixed other OS projects and (potentially) proprietary ones, pick Apache or something like it.

    • At OpenNTF, we have a preference for Apache over other similar licenses, because it's well-established and makes copyright handling clearer than the equivalents, something that is critical for large companies. Let past lawyers do your legwork on this one.
  • If you want to require that users of your code keep the code open source, consider the GPL.

    • Be extremely wary of this, however: the GPL is intentionally "infectious" and limits how the code can be used. Various projects carve out little exceptions to the GPL to allow use in otherwise-non-GPL products, but it's still something of a minefield.
    • The GPL is one of the approved licenses for OpenNTF, but we kind of discourage it except in cases where a project is GPL because it's derived from previously-GPL'd code.
  • If you don't want to be bothered too much by copyright and just want the code out there, consider Public Domain. In practice, it's usually best for you to retain copyright, but explicitly declaring Public Domain is certainly an effective way of allowing any use.

For projects in our community, the quick answer is "use Apache". It's permissive, covers copyright, and is known and trusted by pretty much everyone.

More Work Than It's Worth?

Both the earlier parts of this post and Betteridge's Law contribute to making it clear that my answer is "no, it's not more work than it's worth", but I can certainly see why it'd feel that way. The first couple times I submitted projects to OpenNTF and got a "here's some stuff to fix" email from Peter Tanner, part of me definitely chafed at the whole thing. That can be particularly the case for Notes-based contributions - sometimes, you just want to plunk an NTF on the project page and be done with it, and Notes certainly doesn't have a "wrap this NSF copy in a ZIP with LICENSE and NOTICE files" checkbox.

However, as I learned more about the legal importance of having licenses correct and got more practice at doing this stuff from the start, I started to appreciate the whole process. It also turned out to be really helpful to sort this stuff out on smaller projects before working on larger ones, especially ones with established teams and procedures.

In all, it's worth it both to allow you to contribute to larger projects and, regardless of project size, it's worth it for anyone consuming your code.

New Comment