We all know about SourceForge and Debian. Although they have different purposes, they both act as repositories of free software, and most of the practitioners will know that Debian hosts what is considered to be the best projects — judged most worthy by its army of package maintainers. Conversely, many (but by no means all) SourceForge projects languish in obscurity; these are, at best, of little interest outside of the developers who run them, or, at worst, have completely stalled. It is conventional wisdom then that Debian projects receive much more activity from developers than those on communities like SourceForge.
So today’s research question is: How true is this? How much more activity (if at all) do projects in Debian actually receive than their counterparts in SourceForge? To answer this query, two quantifiable and measurable questions are proposed:
- Are the evolutionary characteristics of Debian projects significantly different from those in SourceForge? (In other words, do Debian projects receive so much more activity that we cannot conclude that random statistical noise is responsible for the difference?)
- Does Debian act as a “catalyst”, so that when project are entered into Debian’s repository, the activity around the project increases?
To answer the questions, we need to measure proxies of evolutionary activity. We chose:
- Project age
- Project size
- Number of developers
- Number of commits
How these attributes were measured, and how they helped to answer the questions, will be addressed in the follow-up post.
Karl, what do you think of the more recent use of distributed revision control, where all participants have a full working local copy of the repository i.e. github and gitorious? I woke up in the middle of the night the other day, thinking about this and noted to myself: “Git could be Torvald’s equivalent of the Linux kernel for collaboration. Provides a core tool set for many kinds of collaboration layers” I then had a look around and see I’m not the only one thinking this.
I also encourage you to measure concentration; ie do people step forward and do more of the work, or is it that new people come online.
How do you intend to assess the statistical significance of any change? I looked into “interrupted time series” analysis for this; it’s largely an econometric technique which handles the endogenity of the time-series (which is a big problem for regression etc).
Joss: Distributed version control repositories present a bit of challenge to free software researchers. So far, I haven’t had much exposure to them (I’ve limited my research to CVS and SVN), but GIt is rising in prominence, and I know there are researchers looking into it.
How researchers can deal with them when it comes mining them depends on how they’re used. If they’re used in essentially the same “centralized repository” fashion as their predecessors, that shouldn’t be too much of a challenge. But when the repository is distributed over many locales,there’s whole new challenges. But I suspect constructing a solution isn’t too difficult; after all, the developers themselves need to manage their repos in an organized way. As long as you know their modus operandi for this, you can know what you’re capable of doing with regards to mining.
James: This article is a brief description of some completed and published research. Perhaps your questions will be answered in the next post; or I can point you to a copy of the paper that features it.
Thanks, I’ll keep my eyes open. Feel free to send me the paper directly, too.
Cheers,
James
[…] And so, we revisit the posers put up in a previous post: […]
For the original paper this work appeared in: http://eceasst.cs.tu-berlin.de/index.php/eceasst/article/view/113/111