Lots and Lots of Eclipse Bits

The Eclipse Development Process doesn’t make any specific requirements on what projects are supposed to distribute. Projects are required to operate in an open and transparent manner, so I guess that we could say that project are required to distribute their source code. And they do distribute source code via Git, SVN, and CVS. But how projects provide their code outside of the source code repositories is really left to the individual projects.

Most Eclipse projects distribute something from the download server. Some projects provide p2 repositories. Some provide archives (generally in the form of ZIP or tar.gz) files of project bundles. Some of these archives include bundles from other projects (including project bundles and third-party libraries). Eighty (80) projects, for example, distribute at least one bundle from EMF alongside their own bundles. Seventy-five (75) projects ship at least one bundle from Eclipse Platform. Thirty-five (35) projects include bundles from ECF with their distribution.

I can’t remember the last time that I installed any Eclipse software from an archive file. If you excuse the Eclipse for RCP and RAP Developers package that is. It’s literally been years since I’ve downloaded any other project’s archive file and installed it “the old fashioned way”.

I posed the archive question on Twitter. @dougschaefer replied with “Off-line installs”. This makes sense when you consider users who are stuck behind a firewall and can’t use p2. @irbull suggested that building target environments is still problematic and it is oftentimes easier to just piece together a handful of archives. @njbartlett added that “offline installs are crucial for setting up training courses”. These are great reasons to keep these archives, so I’m willing to accept that having them is valuable (my own usage patterns notwithstanding).

But then, why do so many projects include bits from other projects? Some of the downloads are “all-in-one” packages: convenient distributions that pull all the necessary bits together into a single handy download. This is also good stuff, but certainly adds a lot of weight on our download server when you think about multiple versions of “all-in-one” packages for multiple platforms. This can easily become gigabytes of data in very short order.

I started this discussion because disk space use continues to increase an an alarming rate. We are very concerned not only about the rising cost of maintaining our own servers, and backup; but the cost to our many mirrors. Some mirrors are already selective with regard to what they rsync from our servers. And Denis has done a good job of identifying files that should not be replicated. Even after that pruning, however, there are a lot bits left. As the volume increases, we run the risk of angering and possibly losing mirrors.

If we assert that all the different forms in which project code is distributed today are necessary and vital to the ongoing health of the projects and community, what should we say about retention? I know that many projects have a retention policy for the bits they provide for download. Some even document their policy. For most projects, however, it’s more ad hoc. How long before bits are moved from the download server to the archive server? How long before they’re just moved to the big bit-bin in the sky? What–if anything–do you keep forever? How long are nightly, integration, and milestone builds retained?

I really don’t want to create a formal policy for this. Projects have enough burden without adding still more to the pile. This is, however, the sort of thing that projects need to think about. At least a little. It’s another example of tragedy of the commons: the more we wait for others to step up and take care of the problem for us, the more likely it is that we’re all going to lose out.

It would be helpful, for example, if files–especially large ones–that aren’t downloaded all that much could be moved to the archive server which is not mirrored. Assuming the download script is used, this move will be transparent to your community. If files start to become more popular, they can be moved back.

I consider this the start of the discussion (or–more likely–the continuation of an ongoing discussion). What can the Eclipse Foundation do to help?

For some reason I’m reminded of the bit in This is Spinal Tap, “Are there any requests? … get off the what?”

This entry was posted in EDP. Bookmark the permalink.

2 Responses to Lots and Lots of Eclipse Bits

Ian Bull says:

June 28, 2011 at 14:27

All-In-Ones fine, this makes consumption easier. But for targets, I would prefer projects didn’t ship pieces from other projects. To make things worse, projects sometimes ship different versions, then I’m always left scratching my head: did they *really* need that version?

Konstantin Komissarchik says:

June 28, 2011 at 17:57

> What can the Eclipse Foundation do to help?

I think a considerable improvement can be made by increasing sophistication of the downloads server infrastructure. I will give a few concrete suggestions…

1. Instead of manually-managed downloads/archives split, keep a single logical pool of downloadable artifacts and automatically assign a “demand class” based on actual download numbers. Use the demand class to control which server the artifact resides on, whether it is mirrored, how widely it is mirrored, etc. In anticipation of a big release, demand class for certain artifacts could be manually set with expiration date of say a week. After that time, it reverts to being based on actual download numbers.

2. The requirement to provide a zipped repository is well documented, but there is no reason that both the online repository and the zipped repository need to be kept on the download server. The zip can be created on demand (cached when necessary). See how Hudson does this with build artifacts.

3. Managing downloads via SFTP and other system-level means is difficult and error-prone, especially for most projects that do not have a dedicated or a semi-dedicated releng person who works regularly with these tools. An online portal for managing builds published to the downloads server would go a long way in making it more likely that projects clean up stuff that is no longer necessary. Such a portal can also make it possible to enter expiration rules, initial demand class (see #1), etc.

4. Consider making the downloads server exclusively for release builds and maybe milestone builds. The idea here is that by definition, the demand for dev builds is much lower and they can be served from the build server (Hudson) directly. Hudson, of course, already has pretty sophisticated means for monitoring disk usage and automatically controlling how builds are retained.