The Six-Way Epic: Digging Further into FLOSS Repositories

Not too long ago, I announced the publishing of my first journal article co-authored with Andrea Capiluppi and Cornelia Boldyreff. My mother was very proud — even if she did not understand a single word of it. I will give a brief summary of the article in this post, and if I succeed in whetting your appetite then you can go over to those nice people at Elsevier and buy the article.

The work examines a number of FLOSS repositories to establish whether there are substantive differences between them in terms of a handful of evolutionary attributes. My previous work on this issue has already been discussed in earlier posts. The earlier work compared a Debian sample of projects to a SourceForge sample. The attributes compared were:

  • Size (in lines of code);
  • Duration (time between first and last commit to the version control system);
  • Number of developers (monthly average);
  • Number of commits (monthly average).

It was found that Debian projects were older, larger, and attracted more developers who achieved a greater rate of commits to the codebase, all to a significant degree.

For the journal article we once again used this approach, but this time we cast our net wider and examined six FLOSS repositories, then set out to answer some questions. Is each repository significantly different to all others? Based on the results, what is the nature of these differences? And are there any notable similarities among repositories — after all, some of the repositories are very similar on the surface, as you will see. The chosen repositories were:

  • Debian — a GNU/Linux distribution that hosts a large number of FLOSS projects;
  • GNOME — a desktop environment and development platform for the GNU/Linux operating system;
  • KDE — another desktop environment and development platform for the GNU/Linux operating system;
  • RubyForge — a development management system for projects programmed in the Ruby programming language;
  • Savannah — acts as a central point for the development of many free software projects particular associated with GNU;
  • SourceForge — another development management system for FLOSS projects, and a very popular one at that.

Once again we took a sample of projects from each repository and analysed each one to obtain the four metrics listed above. These values were aggregated together per repository. For an initial idea of the distribution of values we have these boxplots:

Boxplots of measured attributes per repository
Boxplots of measured attributes per repository

To answer the first question (is each repository different to all others) the answer is definitely no; some differences are clearly hinted at by the boxplots. To ascertain more about these differences, and answer the subsequent questions, we carried out paired comparisons for each repository (with 6 repositories that gives 15 combinations, hence 15 comparisons). For each comparison the difference was tested to see whether it was statistically significant or not. The exact figures are printed in the article, but this is the summary of what was found.

  • Size: Debian was the clear winner. Projects in KDE and GNOME were of notably similar size, as were those in Savannah and SourceForge. The former group projects were smaller on average than those in the latter;
  • Duration: These results furnished perhaps the most striking evidence of a measurable divide between the attributes of the chosen repositories (Debian, KDE, and GNOME on one hand, and RubyForge, Savannah and SourceForge on the other), which was observable in some other attributes . We were also suspicious of the RubyForge results given the extreme youth of the projects;
  • Developers: Another divide between the two “groups” identified above;
  • Commits: As with the average number of developers, Debian, GNOME and KDE all manage a higher rate of commits, but the significance of the differences from the other three repositories is weaker. We also suspect the RubyForge commit rate to be artificially high. As already noted, the projects in the RubyForge sample tended to have a very low duration. After a little deeper digging, we suggest that the projects in our sample may have been “dumped” into the repository (which records a number of commits) and quickly ceased any development activity, thereby inflating the monthly rate .

As mentioned above, detailed figures, procedures and conclusions are available in the printed article. And it does not end there… later in the article we went further. The patterns we found among the repositories were formulated into a framework for organizing FLOSS repositories according to evolutionary characteristics. This may have impact on individual projects existing in, and moving through, the ecosystem of repositories — definitely of interest to both researchers and developers alike, I hope.

Going to the CSMR Conference

CSMR 2009

This year’s European Conference on Software Maintenance and Reengineering is in Kaiserslautern, Germany. I’ll be there, but my fellow researchers in the Centre of Research on Open Source Software won’t because we’ve been accepted to so many different conferences we have to divide ourselves up because it’s the only way we can afford to attend them all. So for the benefit of my colleagues (and you, dear reader, if you so wish) I’ll try and find time to blog about the more notable presentations I see there.

Debian vs. SourceForge – Round 3

The tour through the comparison between Debian and SourceForge comes to a close by questioning whether Debian acts as a catalyst to evolutionary activity when a project is inserted into the repository. It has already been strongly suggested that projects packaged in Debian are recipients of significantly greater rates of activity.

Of the 50 projects in the Debian sample, 22 of them had a known history of evolutionary activity (monthly averages of number of developers and number of commits) that pre-dated its insertion into Debian, providing us with a “before” and “after”. So we compared the before and after of each project.

Developers

In 18 out of 22 projects, the distinct number of developers increases after being added to Debian. The remaining 4 experience no change, and have only 1 or 2 known contributors.

Commits

All projects have a greater number of commits in the after period than in the before period. However, the rate of commits in each period (the total commits within that period divided by its duration) only increases for 10 of the 22.

Summary

In summary of this trilogy I can say, from an absolute standpoint, that our results suggest Debian projects tend to be older, larger, attract more developers and a greater amount of activity, and all to a very significant degree. Furthermore, from an evolutionary perspective, the “Debian effect” seems to cause the pool of developers contributing to a project to increase when it is packaged by Debian, along with a half-decent chance that activity increases also.

Debian vs. SourceForge – Round 2

And so, we revisit the posers put up in a previous post:

  1. Are Debian’s evolutionary characteristics significantly different to those of SourceForge?
  2. Does Debian act as a catalyst?

To answer these questions, we took a closer look at the software inside them. I’ll briefly explain the method here, but details of the steps will be part of later posts in the “Research Methods” strand.

We chose a mutually exclusive sample of 50 packages from Debian and 50 projects from SourceForge.  In both cases they were taken from the pool of “stable” projects only. They were all downloaded and each project’s activity was extracted from their version control system (using log commands) and recorded in a file. Then we delved into our little toolbox and used some nifty tools to extract the information we needed, that information being the project’s:

  • Age (time between first and last commit to the version control system)
  • Size (in lines of code)
  • Number of developers (monthly average)
  • Number of commits (monthly average)

Each attribute can be aggregated from the 50 projects into a summary value for the repository. So, for example, we can take the ages of the 50 Debian projects and use them to get a mean or a median age. If we do the same thing for SourceForge we can compare them.

And that’s just what we did.

And here’s just what we found:

Boxplots of measured attributes
Boxplots of measured attributes

Using statistical significance testing (again, I’ll cover this in a “Research Methods” post) we found that Debian projects had larger values for each attribute, i.e. they were older, larger, and attracted more developers who peformed a greater amount of work, all to a significant degree.

This leads us to our second question, is Debian responsible? Is it somehow a driver for these larger values? Our answer to this question comes in round 3.

Debian vs. SourceForge – Round 1

We all know about SourceForge and Debian. Although they have different purposes, they both act as repositories of free software, and most of the practitioners will know that Debian hosts what is considered to be the best projects — judged most worthy by its army of package maintainers. Conversely, many (but by no means all) SourceForge projects languish in obscurity; these are, at best, of little interest outside of the developers who run them, or, at worst, have completely stalled. It is conventional wisdom then that Debian projects receive much more activity from developers than those on communities like SourceForge.

So today’s research question is: How true is this? How much more activity (if at all) do projects in Debian actually receive than their counterparts in SourceForge? To answer this query, two quantifiable and measurable questions are proposed:

  1. Are the evolutionary characteristics of Debian projects significantly different from those in SourceForge? (In other words, do Debian projects receive so much more activity that we cannot conclude that random statistical noise is responsible for the difference?)
  2. Does Debian act as a “catalyst”, so that when project are entered into Debian’s repository, the activity around the project increases?

To answer the questions, we need to measure proxies of evolutionary activity. We chose:

  • Project age
  • Project size
  • Number of developers
  • Number of commits

How these attributes were measured, and how they helped to answer the questions, will be addressed in the follow-up post.

In The Beginning…

Why write a blog?

Well, why not. It seems like everyone else is.

I’ve been racking my brains to decide what I have to blog, or rather what is interesting enough to share with people. My field is computers; specifically research. I’ve been spending a few years researching free/open source software now, and I think I’ve got into the stride of things enough now to start to write about it.

In this blog, most of the time I plan my entries to fall into one of three categories:

  1. Posts about my research: I’ll share my various little findings that might be of interest to people who want to understand more about free/open source. I’ll try and make them as to understand as possible — if you want the real technical treatment, I’ll point you to the technical paper.
  2. About approaches to research: I also want to pass on the methods and tools you can use to carry out research on software. I hope this will be of interest to practitioners as well as researchers.
  3. Videos: Another little pet project of mine (called Computer Floss) is to produce a series of videos for a general audience that explains all the various facets of open source. I’ve already begun, and you can see them over at:

http://youtube.com/user/directrod

Don’t ask why my username there is directrod.