Table of Contents
Science seeks to explain regularity. Its subject are recurrent phenomena. Its aim is the identification of rules that explain such phenomena. Irregular phenomena or "miracles" -- may they exist or not -- are outside the scope of science. To ensure regularity, the scientific community has developed the paradigm of reproducibility: if a researcher claims to have discovered a formerly unknown regularity, he is obliged to report his observations in a level of detail that would allow anyone to repeat his experiment, thereby enforcing or rejecting the original discovery.
Being modelers, we seek to derive and explain the emergence of recurrent phenomena from underlying rules. Some philosophers of science even go as far as to call models and computer simulations the third pillar of science, raising them to the same importance as the well-established domains to theory and experiment. Yet, most modelers I encounter tend to work in a way that thwarts easy reproduction of their results. This article introduces to some software tools and modelling standards that help me to ensure reproducibility in my daily work.
An example setup
In this article, I will analyze the well-known Lorenz attractor, which is the educational prototype of a chaotic system. The Lorenz attractor results from a three-dimensional system of coupled nonlinear ordinary differential equations. Rather than writing my own numerical integrator, I will employ the program ode which implements several general purpose integrators and is shipped with the plotutils debian package, but should be available for other operating systems as well. To run ode non-interactively, you provide it a short input file with the description of the differential equation, initial condition and output directives. For our system, the input-file reads
x' = -3*(x-y) y' = -x*z + r*x-y z' = x*y - z r = 26 x = 0 y = 1 z = 0 print t,x,y,z step 0,200
Save this program as lorenz.ode. To get the trajectory, we need to call ode like
$ ode < lorenz.odeThis will simply print the time evolution of the system state to stdout which is rather useless, since we usually want to process the output. So let us save it instead:
$ ode < lorenz.ode > lorenz.dat
We can now inspect this data in a graphical plotting program. My favorite is gnuplot. To plot the data, simply start gnuplot interactively with
$ gnuplot(Ctrl-D will terminate gnuplot later) and type:
gnuplot> splot "lorenz.dat" using 2:3:4 with linessplot is the gnuplot command for 3-dimensional plots and the using directive tells gnuplot to get the coordinates from the second, third, and forth column of the data file which we had specified above in lorenz.ode. There is a noteworthy shortcut, here: if you type
gnuplot> splot "< ode < lorenz.ode" using 2:3:4 with lines
gnuplot will execute the program ode lorenz.ode and plot its output immediately without the need for a static data file. Either way, your screen should show a window like this:
Anyway, it is nice to have the trajectory on the screen, but we usually want to have a more persistent representation, say as a postscript file. You can command gnuplot to print postscript to a file with the lines
gnuplot> set terminal postscript color solid gnuplot> set output "lorenz.ps" gnuplot> replot
So far so good. We have our publication ready graphics and we could include it in our draft. This was easy. But what if you later decide that you want to change the initial condition or a model parameter? You will need to remember and redo all the steps that you did in the first place. While this is easy when the steps have been only a few, and you need to redo them only next day, but reality isn't as grateful. This is where make enters the stage.
The basics of make
Make is a classic unix program designed for software development, but it can actually be used for arbitrary task automatization. make is controlled by an input file called Makefile. This control file holds rules that define how any file depends on and can be created by other files. Upon startup, make reads the control file and performs all necessary steps to build a given target.
Unfortunately, the syntax of Makefiles has been designed to be brief to the cost of being unintuitive. Basically a Makefile consists of variable definitions and rules. Here is how a rule for the generation of lorenz.dat looks like:
lorenz.dat : lorenz.ode
ode < lorenz.ode > lorenz.dat
The first line tells make that the file lorenz.dat depends on lorenz.ode
and needs to be rebuilt whenever the latter one changes (This is done by the file's time stamp:
whenever the prerequisite is newer than the target, the target will be redone).
The second line tells make how to build the file, and it is simply the command line we
had entered earlier. One particular flaw of make is that commands must be indented
by tabulators! It is forbidden to use spaces instead. This makes Makefiles particularly nasty
for editors that transform between tabs and spaces. This also implies that you cannot copy and
paste code samples form this site into your editor.
The actual grammar for rules is slightly more general than the example above:
target : prerequisite_1 prerequisite_2 ...
rule 1
rule 2
...
As you notice, there can be more than one prerequisite and there can be more than one rule.
When multiple prerequisites are present, make rebuilds the target when any of the
prerequisites has changed. The commands are executed one after the other, each of them as a
separate shell process. (This implies that you cannot define and use shell variables over multiple
lines)
Saving the above rule in a Makefile enables you to invoke the command
$ make lorenz.datand make will start its magic. If you invoke make without command line arguments it will try to build the first target it encounters in the Makefile. So, its good practice to put the main target up front.
And now comes why I like gnuplot: just as ode, gnuplot can be invoked non-interactively by passing in a script file. We simply need to write down the earlier interactive input into a file that we call lorenz.gpl:
set terminal postscript color solid splot "lorenz.dat" using 2:3:4 with linesNote that I removed the output file specification. This way, gnuplot will simply print the postscript data to stdout and we can capture it in the command line call or the Makefile command. Having written down the gnuplot script, we are ready to include a new rule into our Makefile:
lorenz.ps : lorenz.gpl lorenz.dat
gnuplot lorenz.gpl > lorenz.ps
Our example makefile now contains two rules: one for the generation of lorenz.dat, the other for the generation of lorenz.ps. At any moment, when we type 'make lorenz.ps' into the terminal, make will check if it needs to be redone. The prerequisite list of target lorenz.ps tells that its status depends on lorenz.dat. make recursively checks the target lorenz.dat and learns that it depends on lorenz.ode. Suppose that we have changed some parameters in lorenz.ode. This initiates make to first compute lorenz.dat, and second - since the renewed loenz.dat is now the newest file in town - to redraw lorenz.ps. Wherever you make a change in the system, everything that depends on it will be recursively rebuild.
make also povides automatic variables that can be used to write rules less repetitive. Among the many variables, the following three are particularly usefull: $@ denotes the target of a rule, $^ the list of prerequisites, and $< denotes the first prerequisite in a list. With these, our complete Makefile reads:
lorenz.ps : lorenz.gpl lorenz.dat
gnuplot $< > $@
lorenz.dat : lorenz.ode
ode < $< > $@
It is good practice to also define a rule that will delete all automatically generated content.
But be carefull here with files that took long time to compute: just because you can repeat
their generation does not mean that you actually want to do it. A solution is to have two cleanup
rules, one that preserves computational demanding files and one that does not:
cleanall : clean
rm -f lorenz.dat
clean :
rm -f lorenz.ps
Implicit rules
When you use this approach for a while, you will find that you will repeat your steps very often. For example, the task to use a gnuplot script to visualize some data, is rather repetitive in my daily work. Such patterns can be expressed in Makefiles by using implicit rules. Implicit rules use wild cards in their targets and pre-requisites, which makes them into templates for generalized tasks. Here is how our example from the last section looks when written as implicit rules:
%.ps : %.gpl %.dat
gnuplot $< > $@
%.dat : %.ode
ode < $< > $@
which litterally translates to "to make any postscript file from a gnuplot script and data file with the same suffix, do ..." and "to make any data file from an equally named ode script, do ..."
One use case for implicit rules is to create a general purpose Makefile with task definitions you repeat in many different projects. Make has an include mechanism that allows you to embed such general purpose Makefiles in any project related Makfile (see the make documentation for details about this).
Another, probably more relevant use in the context of computer assisted modeling is the generation of parameters runs. Suppose your simulation program expects some parameters as command line arguments. Let us assume that lorenz.ode has been changed to read its parameter r from the command line (here is how this can be done). Then the following rule can be used to generate trajectories for given values of r
lorenz_r_%.dat : lorenz.ode
ode < lorenz.ode $* >$@
New in this rule is the automatic variable $*. Its associated value is the stem of the implicit rule, i.e. the text that '%' matches against when comparing the rule to the specified target. For example, the call 'make lorenz_r_25.dat' will cause make to run 'ode < lorenz.ode 25 >lorenz_r_25.dat'.
Suppose you want to show the trajectories of several attractors, you could add the rules
figures : lorenz_r_20.ps lorenz_r_24.ps lorenz_r_28.ps lorenz_r_30.ps
lorenz_r_%.ps : lorenz.gpl lorenz_r_%.dat
gnuplot $< >$@
to the Makefile, et voilá, 'make figures' will do what you need.
Variables
In the last section, I said that make can be used to automate parameter runs. To facilitate that, the introduction of variables becomes handy. We have already encountered automatic variables ($<, $^, $@, and $*) which get evaluated automatically. Ordinary variables can be defined and used similar to shell variables. To define a variable, use the '=' sign:
r_values = 20 21 22 23 24 25 26 27 28 29 30
Or, if you have many parameters, the usage of the shell command seq might be more convinient to you:
r_values = $(shell seq 100)
Which generates a sequence from 0 to 99. The value of r_values will simply be the string on the right hand side of the equal sign. To use r_values, write $r_values.
We want to use the variable r_values to automatically construct a list of targets for the implicit rule of the last section (which generates trajectory files for one specific value of r). Here is how to do it: Make offers a powerful mechanism of variable substitutions. The following construct takes the list r_values, surrounds every item with the prefix lorenz_r_ and suffix .dat, and assigns the outcome to the variable lorenz_r_data which is the list of target names we wanted:
lorenz_r_data = ${r_values:%=lorenz_r_%.dat}
The variable lorenz_r_data now evaluates to lorenz_r_20.dat lorenz_r_21.dat lorenz_r_22.dat a.s.o. We are ready to use it in a prerequisite list:
trajectories : $lorenz_r_dataIf you like to send your computer busy now, 'make trajectories' will create you 10 or 100 files depending on which of the two definitions of r_values you used. And of course, its a one-liner to change the number of runs you want to perform. Another advantage is that make would not repeat the creation of e.g. the file lorenz_r_20.dat since it has been created already when computing the figures in the last section.