Techniques for Selection

Analysis steps
Analysis steps

As we have seen before, this figure shows the stages of a typical approach to a post-hoc study of FLOSS, like a digital archaeologist. The figure shows a series of stages, each of which includes some number of steps, and yields some outcomes. Each outcome may or may not feed into the following stage. In this post, I will discuss the selection stage. Remember that this is the method I have preferred so far, and is the method that a number of my peers have used, in whole or in part. It is not the method.

Selection

The point of selection is to choose the metrics that are going to be used to measure the attributes you are interested in, and also to compose a list of projects to study. On selecting metrics, I cannot be especially general. They are very closely tied to the goals of your study and a way needs to be found to come up with an effective way of measuring your success at achieving those goals. My personal preference is the “Goal-Question-Metric” (GQM) Approach. You can read about it elsewhere, but what attracts me to the GQM Approach is that it is a software engineering specific method that helps you come up with the right measures by forming questions needed to achieve the goals you have set. These questions can also “suggest” the hypotheses needed in your study. It is not perfect, but “Goal-Question-Metric” is a useful parallel to “Research Question-Hypotheses-Measures”. I do think it important to do metric selection first; the reason will be apparent in what I shall say next.

Additionally, you will probably want to set some parameters for the investigation, usually to ensure your investigation remains valid. For example, if you looking into some aspect of, say, forum activity then it probably makes no sense to include projects for which no forum activity exists. (At the same time, you should report what proportion of your initial sample is disqualified.) This may impact the pool of projects you can choose from by eliminating some potentials, but it should not impact the metrics you choose — the investigation should be guided by what you want to measure, not how easy something is to measure. Sometimes this is made quite simple for you. For instance, FLOSSMole is a service that provides you with meta-data about individual FLOSS projects in nicely-formatted lists. If you wish to prune such a list then it is easy to write a software tool to do it quickly for you, leaving only the “valid” candidates. Ask nicely and you could borrow mine.

So-called “filters” I have found myself testing for in past work have included:

  • Programming language
  • Version control system used
  • Product size
  • Development status

These factors can impact the validity of the study (e.g. can different programming languages be compared fairly), or be technical considerations (e.g. do I have tools that can analyse these programming languages). Both need careful thought.

Further considerations include your selection method, i.e. how are you choosing the projects to study? If you are examining a very small number of projects, be sure your choice has some careful thought behind it. Generalizing from simply analysing one or a couple of projects can be tricky; a more focused comparative analysis, such as the work by Schach et al comparing four different Unix-like operating system kernels is probably more productive at that level. If you seek to generalize FLOSS as a phenomenon from your analysis, some different works have now been carried out (including my own) that do so by analysing large samples of projects. In this case I think the consensus is that random selection of a filtered population of projects is the best approach.

And so the end of the selection process should be a list of projects you wish to analyse that feeds through to the next stage: retrieval.

A FLOSS Research Toolbox

It is remarkable how, when I look through my box of FLOSS research tools, so many of them are pre-existing tools written by others. In the toolbox (or more precisely, in the directory called “tools”) there are also many self-authored programs or glue code, usually put together in a scripting language, but nevertheless the overall contents of the toolbox are a result of my own strategy. When seeking to put a new tool in the box, the first thing is try to borrow my neighbour’s instead. Searching on the Web or through research papers may reveal a program that is already capable of doing what you need.

Here is a handful of them that I have found useful so far (and which I will blog about in future):

SLOCCount

sloccount
A splendid little utility for counting the lines of executable code in a project, with additional capabilities for distinguishing between an abundance of programming languages, counting by directory, and even estimating the cost of producing the project!

Doxygen

doxygen
A documentation system usable for a handful of popular programming languages. You can construct documentation from markup embedded in the code (akin to Javadoc), or extract the structure from undocumented code. The various output formats, and even the intermediate files meant exclusively for Doxygen’s usage, mean that there’s lots of mining to be done by the keen researcher.

Statcvs/Statsvn

statsvn
Retrieves information from a CVS/Subversion repository and generates various tables and charts describing the project development. Formats output in HTML and XML for parsing at your pleasure.

FLOSSmole

flossmole
Less a tool, but more of a database produced by a webcrawling tool. FLOSSmole provides downloadable raw data about FLOSS projects in multiple formats.

What is noticeable about these (with the probable exception of FLOSSMole) is that I am using them in different ways to their mainstream usage. I will not presume to claim knowledge of the authors’ original intents, but did Doxygen’s project manager, Dimitri van Heesch, ever imagine that someone would be using the “useless” intermediate files left over by Doxygen (which are normally deleted) to perform complex coupling analyses? No doubt he would not be displeased if he knew, but I am sure his mind is most occupied with how it is put to its advertised purposes. On the other hand, it is very difficult to imagine my self-authored tools put to any use other than the ones for which I created them, which is why they exist in the first place: their purpose is so specific, nobody else has yet needed their capabilities.

My FLOSS research toolbox (and, I would venture, those of other FLOSS researchers) has been opportunistically built up over time. I think this is necessarily so. When the capability you want already exists in another program then you take it, in finest FLOSS tradition, thereby enjoying the many fruits of collaborative work like passing around new features, sharing bug-fixes, and preventing duplication of effort. When no tool can suit my purpose (or be adapted to do so), then I must fill the gap with my own creation. No slight revision of the research goals to suit a “near enough” program I have found — tool availability must never dictate the direction of the research — I just have to grit my teeth and hack out some code.

Then I feel like a real programmer again.

On the Veracity of Sources

When I want to learn about something in free/open source software more generally, there are a number of different types of sources to look towards, each one with their own advantages and considerations, and each with an intended audience. So knowing about all of these types of sources serves as a good indicator of where to start looking. Judging the most suitable place from where to obtain veracious information about FLOSS reminds me a little about the same problem in science.

If you want to learn about the latest from the world of quantum physics, where do you go? A research paper? That is surely guaranteed to bring fresh news from the physics trenches, but it will certainly assume some domain knowledge on the part of the reader and a grasp of some sophisticated mathematics. Failing that, you could wait for someone to write a more understandable treatment of the subject — a magazine article in New Scientist will probably not take long to become available; but alternatively a book will contain more detail if you can wait. The nuances within such an example do not differ wildly from those you might observe with the same question in FLOSS.

So what types of sources exist for FLOSS and how are they useful or problematic? This is my take on the taxonomy.

Research works

Not too long after “open source” was coined as a phrase in 1998, serious research institutes began to look into it, hence we now have a decade’s worth of peer-reviewed research carried out by universities, institutes and other organisations. In addition to the many stand-alone works, there have been a number of research projects devoted to FLOSS that routinely publish their findings, including CALIBRE, SQO-OSS and FLOSSMetrics as examples in Europe alone. Some academic conferences, particularly those dedicated to software engineering, now even explicitly refer to FLOSS as a topic of research. Such publications are the things to look through for the latest information, but they typically assume a rather high level of familiarity with the concepts, and occasionally a good knowledge of maths and programming. They are normally published in conference proceedings and journals, which might cost a pretty penny to access.

Technical reports

Many organisations, typically with some particular expertise within FLOSS, release technically-oriented reports on their research or experiences. They may sometimes be released as part of wider research in which the organisation is involved and the Internet is often utilised as the distributive medium these days. They are very much like research papers, but it is likely they have not have been subjected to a wider peer-review process.

Books

Now the field really widens up. Many many books have been authored over the last ten years on FLOSS, aimed at both specialists and at a more general audience. You can purchase an entire book dedicated to the sed stream editor, or you can read Eric Raymond’s general thesis on the open source movement (which I think is readable by any interested layperson), so the understandability of information within books is much more wide-ranging than that of peer-reviewed research papers. And yet, rather similarly to research works, choosing your sources can depend on reputation (of the publisher as well as the author); but having said that, if you are looking for a book as a place to begin your quest for information, you probably have insufficient first-hand knowledge about reputations. In such a case it would be prudent to check your choice with someone more “in the know”.

Magazine articles

Like books, magazine articles on FLOSS may be intended for the specialist (such as those found in the IEEE publication “Computer“), or the more general audience, which may or may not feature in a software-oriented publication. Once again, it is necessary to be discerning depending upon a combination of the author’s credentials, the depth of information given and the rigour demonstrated (for example, a FLOSS article from the BBC News, regardless of how well written, is unlikely to be of sufficient depth to do anything other than spark an interest in an otherwise unfamiliar topic). But furthermore one must acknowledge the probable brevity of articles appearing in magazines or newspapers. I would suggest an article is a useful way to whet your newly developing appetite, or as a quick way to keep up with the Joneses.

Websites, blogs, etc.

There are numerous and diverse resources on the Internet addressing free software with greatly varying quality and intent, and, for organisations or projects concerned with FLOSS, they may be the primary method of contact and dissemination. Their reliability may be judged upon by the identities of the authors/publishing organisation, or by the opportunity for corroboration. Some websites and blogs are the Internet presence of authors accessible via other media (for example, a number of published FLOSS researchers maintain blogs and websites, such as Diomidis Spinellis, Paul Adams, Martin Krafft, and many more), but in a number of other cases the authors remain anonymous which presents the difficulty of establishing their credentials. Even then this does not necessarily preclude a website being useful for research — Groklaw is one of the most well-known websites devoted mostly to discussing FLOSS legal issues, and whilst its authors are anonymous it provides copies of or references to actual legal proceedings.

Now, not to destroy my taxonomy having just built it, but this is clearly not the only useful way of looking at it. I doubt whether one single class of person uses a single type of source to the exclusion of all others, and the types of sources are certainly not mutually exclusive in terms of their veracity: A well-written blog post by an expert has the potential to be a more veracious source than a book or article that achieves only mediocrity.

In search of the ultimate idiom with which to conclude, I’ll borrow from my British heritage and say “It’s all swings and roundabouts.”

Digital Archaeology

Part of the use to which I’d like to put this blog is to disseminate information about research methods and tools. But before I start writing posts with involved details it’s probably prudent to present some sort of overview of the whole thing. Of course, there is no single method that is used by all computer scientists, although each method usually tries to approximate the scientific method as closely as possible. Hence, what I have to talk about is not the method utilized by all researchers, but it is a common one in the sub-field of free/open source and software evolution.

It was, I think, Daniel German who first suggested the role of a software evolutionist — a kind of palaeontologist, or private investigator, of software. Like a detective or an archaeologist, the software evolutionist arrives at the scene. Before her is a program listing, thousands of lines long. She doesn’t know how it got to be in the state she finds it, but clues may be available for her to piece its development together.

Linux kernel growth
Linux kernel growth

Besides the code, there’s the support documentation (maybe that will tell her how the program is meant to function). Also open on the computer is a forum where all the developers communicate (perhaps this will shed some light on what the developers were assigned). And on the server is a version control system, a treasure trove of clues that shows exactly which developers did what, and when they did it.

Unlike the detective we’re not trying to find a murderer of course, but we are trying to piece together how the program developed over time, i.e. how it evolved. An early example of this was done by Michael Godfrey and Qiang Tu: with nothing but a load of historical releases of the Linux kernel between 1994 and 1999, they showed that the kernel grew at a super-linear rate (it grows by a larger amount as time goes by) and identified which parts of the kernel were responsible for this surprising growth. (Spoiler: the portion of the kernel that contains device drivers was the biggest driver of this growth.)

So how do software evolutionists do it? As I said I can’t speak for them all, but I’ll try to articulate an abstract version of the steps that I and others go through, and assume it approximates the experience of the rest.

Roughly speaking the typical steps involve:

  • Selection: Both of the project to study and the measures you wish to apply;
  • Retrieval: Getting hold of the software (not always easy!) and storing it appropriately;
  • Extraction: Parsing the raw data, extracting the pieces you are interested in, and constructing them into useful information;
  • Analysis: Applying the measures and performing your relevant test(s).

Analysis steps

In later posts of this category I’ll discuss the tools and techniques of each stage, and (hopefully) build up a picture of the method. For now, I’ll show trivially how an analysis of the Linux kernel size might fit in with this approach (taking cues from Godfrey and Tu’s study where possible).

  • Selection: The Linux kernel is selected as a large exemplary open source project. Because the size is the attribute of interest, the number of lines of code is taken as a measure of size. To be scientific we should form some testable hypotheses predicting what we expect to find.
  • Retrieval: Each kernel version release is available on the Linux Kernel Archives as a tar file. Godfrey and Tu downloaded 96 of the releases.
  • Extraction: Now the lines of code (LOC) are counted in each release. Godfrey and Tu applied the Unix command “wc -l” to all *.c and *.h files and used an awk script to ignore non-executable lines.
  • Analysis: By this point, there should be 96 numbers stored, each the size of a release in LOC. To get a visual, we can feed them into a plotting program and produce a nice graph like the one above. We could even go further and apply all sorts of fancy mathematics or models. Suffice it to say, by the end of this stage we should have some results that allow us to confirm or refute our earlier hypothesis.

Once all this is done, we can then put forward out conclusions. Like a scientific study, the experimental data we have obtained is the evidence that backs them up.

In The Beginning…

Why write a blog?

Well, why not. It seems like everyone else is.

I’ve been racking my brains to decide what I have to blog, or rather what is interesting enough to share with people. My field is computers; specifically research. I’ve been spending a few years researching free/open source software now, and I think I’ve got into the stride of things enough now to start to write about it.

In this blog, most of the time I plan my entries to fall into one of three categories:

  1. Posts about my research: I’ll share my various little findings that might be of interest to people who want to understand more about free/open source. I’ll try and make them as to understand as possible — if you want the real technical treatment, I’ll point you to the technical paper.
  2. About approaches to research: I also want to pass on the methods and tools you can use to carry out research on software. I hope this will be of interest to practitioners as well as researchers.
  3. Videos: Another little pet project of mine (called Computer Floss) is to produce a series of videos for a general audience that explains all the various facets of open source. I’ve already begun, and you can see them over at:

http://youtube.com/user/directrod

Don’t ask why my username there is directrod.