Moving on…

Joss Winn and the University of Lincoln have, since I left in last year, kindly helped me maintain this blog on the Lincoln blog community. However, that requires that the University maintains my profile, even though I’ve officially left, and means people back in Lincoln have to keep acting on my behalf to do keep doing so.

Because of this, I’ve decided to move to my blog to Not ideal (I can’t hack my blog like I can on Lincoln’s servers), but at least I will no longer pester people.

My thanks go to Joss and the University of Lincoln for providing me with the weblog and help maintaining it.

My new address is: All new content will go there.

Update your links accordingly!

An Introduction to Saros

I feel as though I am almost completely adjusted and acclimatised to my new surroundings, and furthermore I am now becoming fully immersed in my new work here at the Free University of Berlin (FU). Therefore, I think I can begin to introduce the new work I am now part of. My research concerns have now diversified a little: they continue to include FLOSS and empirical methods, but they also now encompass Agile Methods. Let me explain…

I am now part of Saros, an on-going project here at the Institute of Computer Science. Saros is a free/open source Eclipse plug-in that enables distributed collaborative software development within the Eclipse environment. Developers can work simultaneously on shared documents over a network, allowing concurrent access to geographically dispersed programmers. Changes made are instantaneously seen by all participants, and Saros makes it clear who made what changes.

A screenshot of Saros
A screenshot of Saros (click to enlarge)

The plugin is useful in a number of scenarios, such as collaborative development, joint code reviews, or knowledge transfer. But particularly cool, is the support for distributed pair programming, by allocating the roles of “driver” or “observer” to everyone involved. In time, more features will help Saros replicate the pair programming experience in a distributed environment.

It seems I have joined a project that is already thriving thanks to the hard work and dedication of both the researchers and the team of bachelors/masters students who have contributed to the software in the previous months. Feedback from users is coming in steadily, the number of downloads is now measurable in thousands, and there are a few “testimonials” to Saros from outside parties (locatable on the FU website).

And there is plenty of life left in the project because a steady stream of students continue to sign up to work on new aspects. Together with the existing team they have some exciting new features in the works that will enrich the feature set of Saros.

Why not go to the SourceForge page and try it out for yourself?

Techniques for Selection

Analysis steps
Analysis steps

As we have seen before, this figure shows the stages of a typical approach to a post-hoc study of FLOSS, like a digital archaeologist. The figure shows a series of stages, each of which includes some number of steps, and yields some outcomes. Each outcome may or may not feed into the following stage. In this post, I will discuss the selection stage. Remember that this is the method I have preferred so far, and is the method that a number of my peers have used, in whole or in part. It is not the method.


The point of selection is to choose the metrics that are going to be used to measure the attributes you are interested in, and also to compose a list of projects to study. On selecting metrics, I cannot be especially general. They are very closely tied to the goals of your study and a way needs to be found to come up with an effective way of measuring your success at achieving those goals. My personal preference is the “Goal-Question-Metric” (GQM) Approach. You can read about it elsewhere, but what attracts me to the GQM Approach is that it is a software engineering specific method that helps you come up with the right measures by forming questions needed to achieve the goals you have set. These questions can also “suggest” the hypotheses needed in your study. It is not perfect, but “Goal-Question-Metric” is a useful parallel to “Research Question-Hypotheses-Measures”. I do think it important to do metric selection first; the reason will be apparent in what I shall say next.

Additionally, you will probably want to set some parameters for the investigation, usually to ensure your investigation remains valid. For example, if you looking into some aspect of, say, forum activity then it probably makes no sense to include projects for which no forum activity exists. (At the same time, you should report what proportion of your initial sample is disqualified.) This may impact the pool of projects you can choose from by eliminating some potentials, but it should not impact the metrics you choose — the investigation should be guided by what you want to measure, not how easy something is to measure. Sometimes this is made quite simple for you. For instance, FLOSSMole is a service that provides you with meta-data about individual FLOSS projects in nicely-formatted lists. If you wish to prune such a list then it is easy to write a software tool to do it quickly for you, leaving only the “valid” candidates. Ask nicely and you could borrow mine.

So-called “filters” I have found myself testing for in past work have included:

  • Programming language
  • Version control system used
  • Product size
  • Development status

These factors can impact the validity of the study (e.g. can different programming languages be compared fairly), or be technical considerations (e.g. do I have tools that can analyse these programming languages). Both need careful thought.

Further considerations include your selection method, i.e. how are you choosing the projects to study? If you are examining a very small number of projects, be sure your choice has some careful thought behind it. Generalizing from simply analysing one or a couple of projects can be tricky; a more focused comparative analysis, such as the work by Schach et al comparing four different Unix-like operating system kernels is probably more productive at that level. If you seek to generalize FLOSS as a phenomenon from your analysis, some different works have now been carried out (including my own) that do so by analysing large samples of projects. In this case I think the consensus is that random selection of a filtered population of projects is the best approach.

And so the end of the selection process should be a list of projects you wish to analyse that feeds through to the next stage: retrieval.

A Life Update

I seem to have fallen off the blogosphere in the past few weeks. There is good reason for this I assure you.

Firstly, I had been busy preparing for my viva voce, the oral defence of my PhD thesis and the last hurdle to completion. I am happy to say that I was passed, I have my doctorate and I am now a free man!

Secondly, my time has also been taken up preparing for the impending move to my new position as a researcher at Freie Universität Berlin in Germany. All being well I should begin there next week. My work will continue to involve FLOSS, but will be focused on agile methods also.

The University of Lincoln has very kindly permitted me to continue this blog. I will continue to write here on the topic of FLOSS, and hopefully I will be able to share my new work also.

Brayford Pool, University of Lincoln
Brayford Pool, University of Lincoln

I will miss Lincoln it is a wonderful place; it is a very pretty city and the University has been good to me. I leave behind some great friends and associates, but I take many cherished memories with me.

The Ongoing Saga of Bletchley Park’s Survival

Bletchley Park, which was the codebreaking hub of the Allies during the Second World War and is now a sizeable and very entertaining museum, has had a rough time in recent years. Parts of it are in a dilapidated state and it seems to survive only on charitable donations and the careful devotion of those who work there. Recent news is encouraging: the Heritage Lottery Fund has given a provisional thumbs-up to a grant of around £500,000 (the application has now progressed to the next bidding stage), which is in addition to investment of around £1,000,000 from English Heritage and Milton Keynes Council. This is all good news and I hope more grants will be awarded. After all, the Park puts the cost of genuine renovation at around £10 million, meaning that these grants are still less than one-fifth of what is needed.

Whenever I am at Bletchley there the word “potential” is always in my mind. The museum is already a great place to visit with plenty to see, and the whole site covers a sizeable area with much potential for suitable development. As well as the many exhibits from the computer stone age that make Bletchley a technophiles dream, there are also many other exhibits about life during the Second World War (with stacks of contemporary artefacts) for those both at home and away at the front. I think Bletchley Park could be built up into a superb site, and a match for any heritage site in the country.

The Alan Turing Apology — Why?

Not so long ago, we celebrated the posthumous birthday of Alan Turing, who has achieved heroic status in the computing field. He made critical contributions to algorithms, computation, computer design and artificial intelligence, and famously played a role in the efforts to break the German codes during the Second World War. However, his story does not end well. Turing was gay at a time when homosexuality was illegal and considered an illness. He was forced to undergo humiliating “treatment” in the form of oestrogen injections, and had his security clearance revoked, leaving his life and career in ruins.

Recently, a petition has been raised on the Downing Street petition website demanding an apology from the UK government for this. I have not signed this petition, and I do not plan to, because I cannot see what it would achieve. (This seems to put me in opposition to many eminent people, not the least of which is Prof. Richard Dawkins.) For one thing, who is the apology from? The gross mistreatment of Alan Turing occured in 1952 — a great many people in government today were not even born then, and anyone in authority at the time has long since retired or died. An apology would be curiously hollow.

The present government’s position on gay rights is quite clear from their actions over the past ten years, which far outweigh any apology or affirmative statement they could make; the government’s opinion on the injustice meted out to Turing can be easily deduced from those. And surely much better ways exist, as lead petitioner John Graham-Cumming puts it, for “people [to] hear about Alan Turing and realise his incredible impact on the modern world, and how terrible the impact of prejudice was on him.”

The Six-Way Epic: Digging Further into FLOSS Repositories

Not too long ago, I announced the publishing of my first journal article co-authored with Andrea Capiluppi and Cornelia Boldyreff. My mother was very proud — even if she did not understand a single word of it. I will give a brief summary of the article in this post, and if I succeed in whetting your appetite then you can go over to those nice people at Elsevier and buy the article.

The work examines a number of FLOSS repositories to establish whether there are substantive differences between them in terms of a handful of evolutionary attributes. My previous work on this issue has already been discussed in earlier posts. The earlier work compared a Debian sample of projects to a SourceForge sample. The attributes compared were:

  • Size (in lines of code);
  • Duration (time between first and last commit to the version control system);
  • Number of developers (monthly average);
  • Number of commits (monthly average).

It was found that Debian projects were older, larger, and attracted more developers who achieved a greater rate of commits to the codebase, all to a significant degree.

For the journal article we once again used this approach, but this time we cast our net wider and examined six FLOSS repositories, then set out to answer some questions. Is each repository significantly different to all others? Based on the results, what is the nature of these differences? And are there any notable similarities among repositories — after all, some of the repositories are very similar on the surface, as you will see. The chosen repositories were:

  • Debian — a GNU/Linux distribution that hosts a large number of FLOSS projects;
  • GNOME — a desktop environment and development platform for the GNU/Linux operating system;
  • KDE — another desktop environment and development platform for the GNU/Linux operating system;
  • RubyForge — a development management system for projects programmed in the Ruby programming language;
  • Savannah — acts as a central point for the development of many free software projects particular associated with GNU;
  • SourceForge — another development management system for FLOSS projects, and a very popular one at that.

Once again we took a sample of projects from each repository and analysed each one to obtain the four metrics listed above. These values were aggregated together per repository. For an initial idea of the distribution of values we have these boxplots:

Boxplots of measured attributes per repository
Boxplots of measured attributes per repository

To answer the first question (is each repository different to all others) the answer is definitely no; some differences are clearly hinted at by the boxplots. To ascertain more about these differences, and answer the subsequent questions, we carried out paired comparisons for each repository (with 6 repositories that gives 15 combinations, hence 15 comparisons). For each comparison the difference was tested to see whether it was statistically significant or not. The exact figures are printed in the article, but this is the summary of what was found.

  • Size: Debian was the clear winner. Projects in KDE and GNOME were of notably similar size, as were those in Savannah and SourceForge. The former group projects were smaller on average than those in the latter;
  • Duration: These results furnished perhaps the most striking evidence of a measurable divide between the attributes of the chosen repositories (Debian, KDE, and GNOME on one hand, and RubyForge, Savannah and SourceForge on the other), which was observable in some other attributes . We were also suspicious of the RubyForge results given the extreme youth of the projects;
  • Developers: Another divide between the two “groups” identified above;
  • Commits: As with the average number of developers, Debian, GNOME and KDE all manage a higher rate of commits, but the significance of the differences from the other three repositories is weaker. We also suspect the RubyForge commit rate to be artificially high. As already noted, the projects in the RubyForge sample tended to have a very low duration. After a little deeper digging, we suggest that the projects in our sample may have been “dumped” into the repository (which records a number of commits) and quickly ceased any development activity, thereby inflating the monthly rate .

As mentioned above, detailed figures, procedures and conclusions are available in the printed article. And it does not end there… later in the article we went further. The patterns we found among the repositories were formulated into a framework for organizing FLOSS repositories according to evolutionary characteristics. This may have impact on individual projects existing in, and moving through, the ecosystem of repositories — definitely of interest to both researchers and developers alike, I hope.