The Six-Way Epic: Digging Further into FLOSS Repositories

Not too long ago, I announced the publishing of my first journal article co-authored with Andrea Capiluppi and Cornelia Boldyreff. My mother was very proud — even if she did not understand a single word of it. I will give a brief summary of the article in this post, and if I succeed in whetting your appetite then you can go over to those nice people at Elsevier and buy the article.

The work examines a number of FLOSS repositories to establish whether there are substantive differences between them in terms of a handful of evolutionary attributes. My previous work on this issue has already been discussed in earlier posts. The earlier work compared a Debian sample of projects to a SourceForge sample. The attributes compared were:

  • Size (in lines of code);
  • Duration (time between first and last commit to the version control system);
  • Number of developers (monthly average);
  • Number of commits (monthly average).

It was found that Debian projects were older, larger, and attracted more developers who achieved a greater rate of commits to the codebase, all to a significant degree.

For the journal article we once again used this approach, but this time we cast our net wider and examined six FLOSS repositories, then set out to answer some questions. Is each repository significantly different to all others? Based on the results, what is the nature of these differences? And are there any notable similarities among repositories — after all, some of the repositories are very similar on the surface, as you will see. The chosen repositories were:

  • Debian — a GNU/Linux distribution that hosts a large number of FLOSS projects;
  • GNOME — a desktop environment and development platform for the GNU/Linux operating system;
  • KDE — another desktop environment and development platform for the GNU/Linux operating system;
  • RubyForge — a development management system for projects programmed in the Ruby programming language;
  • Savannah — acts as a central point for the development of many free software projects particular associated with GNU;
  • SourceForge — another development management system for FLOSS projects, and a very popular one at that.

Once again we took a sample of projects from each repository and analysed each one to obtain the four metrics listed above. These values were aggregated together per repository. For an initial idea of the distribution of values we have these boxplots:

Boxplots of measured attributes per repository
Boxplots of measured attributes per repository

To answer the first question (is each repository different to all others) the answer is definitely no; some differences are clearly hinted at by the boxplots. To ascertain more about these differences, and answer the subsequent questions, we carried out paired comparisons for each repository (with 6 repositories that gives 15 combinations, hence 15 comparisons). For each comparison the difference was tested to see whether it was statistically significant or not. The exact figures are printed in the article, but this is the summary of what was found.

  • Size: Debian was the clear winner. Projects in KDE and GNOME were of notably similar size, as were those in Savannah and SourceForge. The former group projects were smaller on average than those in the latter;
  • Duration: These results furnished perhaps the most striking evidence of a measurable divide between the attributes of the chosen repositories (Debian, KDE, and GNOME on one hand, and RubyForge, Savannah and SourceForge on the other), which was observable in some other attributes . We were also suspicious of the RubyForge results given the extreme youth of the projects;
  • Developers: Another divide between the two “groups” identified above;
  • Commits: As with the average number of developers, Debian, GNOME and KDE all manage a higher rate of commits, but the significance of the differences from the other three repositories is weaker. We also suspect the RubyForge commit rate to be artificially high. As already noted, the projects in the RubyForge sample tended to have a very low duration. After a little deeper digging, we suggest that the projects in our sample may have been “dumped” into the repository (which records a number of commits) and quickly ceased any development activity, thereby inflating the monthly rate .

As mentioned above, detailed figures, procedures and conclusions are available in the printed article. And it does not end there… later in the article we went further. The patterns we found among the repositories were formulated into a framework for organizing FLOSS repositories according to evolutionary characteristics. This may have impact on individual projects existing in, and moving through, the ecosystem of repositories — definitely of interest to both researchers and developers alike, I hope.

What is Open Source?

Computer Floss, the video series that aims to enlighten the general audience about free/open source software. Here is a transcript of the first episode:

Welcome to Computer Floss, a series of videos all about the the open source software movement. In this series I’ll be trying to enlighten and inform you about what open source actually is, how it works, why it matters and who’s doing it.

You might have heard this phrase “open source” before, but it may not mean much to you. So you may be thinking: What is it Why should I care? What does this open source thingy matter to me?

Dilbert... a nerd.
Dilbert: nerd.

To begin to answer these questions, I’ll have to lay out one or two fundamentals. As I’m sure you’re aware, programmers are spotty nerds that sit in front of a computer typing away all day long writing programs — but what are they actually doing when they write programs?  They’re writing a collection of instructions, and these lay out exactly how a program behaves. These instructions are called source code, and what’s critical about source code is that it’s understandable by humans beings…and programmers too.

Source code is *not* understandable by a computer, so before it can be run by a computer it has to be put through a special program called a compiler and turned into what’s called machine code, that archetypal binary sequence of 1s and 0s that only a computer can make sense of.

And so, at the end of a compilation, you have two things: you have the program that’s made up of machine code, and you still have the source code you began with. It’s essential to keep that source code, because if you ever want to extend your program, or fix it when something goes wrong (as it inevitably does), you need to amend the source code and run it through the compiler again to create an updated copy of the binary software.

With these fundamentals explained, we can use them to define “open source software”, so here goes: For computer software to be “open source”, the author must permit users of the software access to the source code, and grant them the right to change and redistribute that source code according to their needs. In recent years, this concept has become important to a great many organizations, to the extent that *you* might have heard of it and now prowl the internet looking for videos that explain what the hell it is.

Now we know what it is, we should know *why* it’s important. There are many reasons why, and later videos will explain them, but this is a good opportunity to quickly dip our little toe into history and tell the story of the spiritual father-figure of open source software, bearded computer god, Richard Stallman.

Richard Stallman... bearded
Stallman: beard

Stallman was a programmer in the 1970s, and up until that point hardware was king; computer companies cared only about selling computers — and software was just a boring sideshow. But this notion died away along with disco as the 1980s set in. Gradually, software became important, and lots of companies thought that they could make more money by selling only the binary code and keeping the source code a secret, rendering it proprietary. Stallman grew increasingly frustrated by this, until everything came to ahead when his organization got a new printer. Unlike their old one, the source code that controlled this new printer was kept secret and proprietary by the supplier, so Stallman was no longer able to fix all the faults when the damned thing wouldn’t work properly. It would have been quite a simple job for Stallman to fix the faults, and even tailor it to his organizations particular needs. But when the supplier denied Stallman’s request for a copy of the source code, citing it as a trade secret, he got mad with the increasing inability to alter the software that he had paid for, and so quit, and started the GNU Project devoted to developing what he called “free software”, for which “open source software” is basically an alternative name.

The printer story is an important one because it illustrates what happens when source code is kept proprietary. Under these conditions, software essentially becomes a black box which closes off the insides to any amateur tinkering, like sealing up the engine of your car. In fact, worse than that, it welds the box shut in such a way that it’s impossible for *anyone* to get inside to make any changes whatsoever, other than the original manufacturer. When software is open source, it guarantees that you, or anyone you choose, can alter the software in whatever way you want. And that’s, well, good isn’t it? After all, even if you knew nothing about car mechanics, you’d still prefer that your engine wasn’t welded permanently
shut.