Techniques for Selection

Analysis steps
Analysis steps

As we have seen before, this figure shows the stages of a typical approach to a post-hoc study of FLOSS, like a digital archaeologist. The figure shows a series of stages, each of which includes some number of steps, and yields some outcomes. Each outcome may or may not feed into the following stage. In this post, I will discuss the selection stage. Remember that this is the method I have preferred so far, and is the method that a number of my peers have used, in whole or in part. It is not the method.

Selection

The point of selection is to choose the metrics that are going to be used to measure the attributes you are interested in, and also to compose a list of projects to study. On selecting metrics, I cannot be especially general. They are very closely tied to the goals of your study and a way needs to be found to come up with an effective way of measuring your success at achieving those goals. My personal preference is the “Goal-Question-Metric” (GQM) Approach. You can read about it elsewhere, but what attracts me to the GQM Approach is that it is a software engineering specific method that helps you come up with the right measures by forming questions needed to achieve the goals you have set. These questions can also “suggest” the hypotheses needed in your study. It is not perfect, but “Goal-Question-Metric” is a useful parallel to “Research Question-Hypotheses-Measures”. I do think it important to do metric selection first; the reason will be apparent in what I shall say next.

Additionally, you will probably want to set some parameters for the investigation, usually to ensure your investigation remains valid. For example, if you looking into some aspect of, say, forum activity then it probably makes no sense to include projects for which no forum activity exists. (At the same time, you should report what proportion of your initial sample is disqualified.) This may impact the pool of projects you can choose from by eliminating some potentials, but it should not impact the metrics you choose — the investigation should be guided by what you want to measure, not how easy something is to measure. Sometimes this is made quite simple for you. For instance, FLOSSMole is a service that provides you with meta-data about individual FLOSS projects in nicely-formatted lists. If you wish to prune such a list then it is easy to write a software tool to do it quickly for you, leaving only the “valid” candidates. Ask nicely and you could borrow mine.

So-called “filters” I have found myself testing for in past work have included:

  • Programming language
  • Version control system used
  • Product size
  • Development status

These factors can impact the validity of the study (e.g. can different programming languages be compared fairly), or be technical considerations (e.g. do I have tools that can analyse these programming languages). Both need careful thought.

Further considerations include your selection method, i.e. how are you choosing the projects to study? If you are examining a very small number of projects, be sure your choice has some careful thought behind it. Generalizing from simply analysing one or a couple of projects can be tricky; a more focused comparative analysis, such as the work by Schach et al comparing four different Unix-like operating system kernels is probably more productive at that level. If you seek to generalize FLOSS as a phenomenon from your analysis, some different works have now been carried out (including my own) that do so by analysing large samples of projects. In this case I think the consensus is that random selection of a filtered population of projects is the best approach.

And so the end of the selection process should be a list of projects you wish to analyse that feeds through to the next stage: retrieval.

A FLOSS Research Toolbox

It is remarkable how, when I look through my box of FLOSS research tools, so many of them are pre-existing tools written by others. In the toolbox (or more precisely, in the directory called “tools”) there are also many self-authored programs or glue code, usually put together in a scripting language, but nevertheless the overall contents of the toolbox are a result of my own strategy. When seeking to put a new tool in the box, the first thing is try to borrow my neighbour’s instead. Searching on the Web or through research papers may reveal a program that is already capable of doing what you need.

Here is a handful of them that I have found useful so far (and which I will blog about in future):

SLOCCount

sloccount
A splendid little utility for counting the lines of executable code in a project, with additional capabilities for distinguishing between an abundance of programming languages, counting by directory, and even estimating the cost of producing the project!

Doxygen

doxygen
A documentation system usable for a handful of popular programming languages. You can construct documentation from markup embedded in the code (akin to Javadoc), or extract the structure from undocumented code. The various output formats, and even the intermediate files meant exclusively for Doxygen’s usage, mean that there’s lots of mining to be done by the keen researcher.

Statcvs/Statsvn

statsvn
Retrieves information from a CVS/Subversion repository and generates various tables and charts describing the project development. Formats output in HTML and XML for parsing at your pleasure.

FLOSSmole

flossmole
Less a tool, but more of a database produced by a webcrawling tool. FLOSSmole provides downloadable raw data about FLOSS projects in multiple formats.

What is noticeable about these (with the probable exception of FLOSSMole) is that I am using them in different ways to their mainstream usage. I will not presume to claim knowledge of the authors’ original intents, but did Doxygen’s project manager, Dimitri van Heesch, ever imagine that someone would be using the “useless” intermediate files left over by Doxygen (which are normally deleted) to perform complex coupling analyses? No doubt he would not be displeased if he knew, but I am sure his mind is most occupied with how it is put to its advertised purposes. On the other hand, it is very difficult to imagine my self-authored tools put to any use other than the ones for which I created them, which is why they exist in the first place: their purpose is so specific, nobody else has yet needed their capabilities.

My FLOSS research toolbox (and, I would venture, those of other FLOSS researchers) has been opportunistically built up over time. I think this is necessarily so. When the capability you want already exists in another program then you take it, in finest FLOSS tradition, thereby enjoying the many fruits of collaborative work like passing around new features, sharing bug-fixes, and preventing duplication of effort. When no tool can suit my purpose (or be adapted to do so), then I must fill the gap with my own creation. No slight revision of the research goals to suit a “near enough” program I have found — tool availability must never dictate the direction of the research — I just have to grit my teeth and hack out some code.

Then I feel like a real programmer again.