Debian vs. SourceForge – Round 3

The tour through the comparison between Debian and SourceForge comes to a close by questioning whether Debian acts as a catalyst to evolutionary activity when a project is inserted into the repository. It has already been strongly suggested that projects packaged in Debian are recipients of significantly greater rates of activity.

Of the 50 projects in the Debian sample, 22 of them had a known history of evolutionary activity (monthly averages of number of developers and number of commits) that pre-dated its insertion into Debian, providing us with a “before” and “after”. So we compared the before and after of each project.

Developers

In 18 out of 22 projects, the distinct number of developers increases after being added to Debian. The remaining 4 experience no change, and have only 1 or 2 known contributors.

Commits

All projects have a greater number of commits in the after period than in the before period. However, the rate of commits in each period (the total commits within that period divided by its duration) only increases for 10 of the 22.

Summary

In summary of this trilogy I can say, from an absolute standpoint, that our results suggest Debian projects tend to be older, larger, attract more developers and a greater amount of activity, and all to a very significant degree. Furthermore, from an evolutionary perspective, the “Debian effect” seems to cause the pool of developers contributing to a project to increase when it is packaged by Debian, along with a half-decent chance that activity increases also.

Debian vs. SourceForge – Round 2

And so, we revisit the posers put up in a previous post:

  1. Are Debian’s evolutionary characteristics significantly different to those of SourceForge?
  2. Does Debian act as a catalyst?

To answer these questions, we took a closer look at the software inside them. I’ll briefly explain the method here, but details of the steps will be part of later posts in the “Research Methods” strand.

We chose a mutually exclusive sample of 50 packages from Debian and 50 projects from SourceForge.  In both cases they were taken from the pool of “stable” projects only. They were all downloaded and each project’s activity was extracted from their version control system (using log commands) and recorded in a file. Then we delved into our little toolbox and used some nifty tools to extract the information we needed, that information being the project’s:

  • Age (time between first and last commit to the version control system)
  • Size (in lines of code)
  • Number of developers (monthly average)
  • Number of commits (monthly average)

Each attribute can be aggregated from the 50 projects into a summary value for the repository. So, for example, we can take the ages of the 50 Debian projects and use them to get a mean or a median age. If we do the same thing for SourceForge we can compare them.

And that’s just what we did.

And here’s just what we found:

Boxplots of measured attributes
Boxplots of measured attributes

Using statistical significance testing (again, I’ll cover this in a “Research Methods” post) we found that Debian projects had larger values for each attribute, i.e. they were older, larger, and attracted more developers who peformed a greater amount of work, all to a significant degree.

This leads us to our second question, is Debian responsible? Is it somehow a driver for these larger values? Our answer to this question comes in round 3.

Debian vs. SourceForge – Round 1

We all know about SourceForge and Debian. Although they have different purposes, they both act as repositories of free software, and most of the practitioners will know that Debian hosts what is considered to be the best projects — judged most worthy by its army of package maintainers. Conversely, many (but by no means all) SourceForge projects languish in obscurity; these are, at best, of little interest outside of the developers who run them, or, at worst, have completely stalled. It is conventional wisdom then that Debian projects receive much more activity from developers than those on communities like SourceForge.

So today’s research question is: How true is this? How much more activity (if at all) do projects in Debian actually receive than their counterparts in SourceForge? To answer this query, two quantifiable and measurable questions are proposed:

  1. Are the evolutionary characteristics of Debian projects significantly different from those in SourceForge? (In other words, do Debian projects receive so much more activity that we cannot conclude that random statistical noise is responsible for the difference?)
  2. Does Debian act as a “catalyst”, so that when project are entered into Debian’s repository, the activity around the project increases?

To answer the questions, we need to measure proxies of evolutionary activity. We chose:

  • Project age
  • Project size
  • Number of developers
  • Number of commits

How these attributes were measured, and how they helped to answer the questions, will be addressed in the follow-up post.