Jasper Slingsby
Especially for an operational forecast system…
Makes use of modern software tools to share data, code, etc to allow others to reproduce the same result as the original study, thus making all analyses open and transparent.
I’m going to try to keep this brief…
For more on the need for and benefits of working reproducibly and more detailed coverage of things like the data life cycle, see https://www.ecologi.st/data-management/
For more info on available tools etc (especially w.r.t. forecasting) see https://projects.ecoforecast.org/taskviews/reproducible-forecasting-workflows.html
“Five selfish reasons to work reproducibly” (Markowetz 2015)
‘Data Pipeline’ from xkcd.com/2054, used under a CC-BY-NC 2.5 license.
Working reproducibly requires careful planning and documentation of each step in your scientific workflow from planning your data collection to sharing your results.
“A Beginner’s Guide to Conducting Reproducible Research” (Alston and Rick 2021):
Complexity - There’s a learning curve in getting to know the tools
Technological change - Hardware and software change over time, making it difficult to rerun old analyses
Human error - Simple mistakes or poor documentation can easily make a study irreproducible
Intellectual property rights - Self-interest can lead to hesitation to share data and code
Most of these are being addressed by the growing culture of reproducible research, with more and more tools available to help researchers work reproducibly.
Entail overlapping/intertwined components, namely:
Specific forecasting requirements:
“The fun bit”, but again, there are many things to bear in mind and keep track of so that your analysis is repeatable. This is largely covered by the sections on Coding and code management and Computing environment and software below
Artwork @allison_horst
Project files and folders can get unwieldy fast and really bog you down!
The main considerations are:
Most projects have similar requirements
Here’s how I usually manage my folders:
“Point-and-click” software may seem easier, but you’ll regret it in the long run… e.g. When you have to rerun your analysis?
R, Python, etc are open source and allow you to do almost any analysis in one workflow - even calling other softwares.
Coding is communication. Messy code is bad communication. Bad communication hampers collaboration and makes it easier to make mistakes…
Streamline, collaborate, reuse, contribute, and fail safely…
It’seasytowritemessyindecipherablecode!!! - Write code for people, not computers!!!
Check out the Tidyverse style guide for R-specific guidance, but here are some basics:
#Header indicating purpose, author, date, version etc
#Define settings and load required libraries
#Read in data
#Wrangle/reformat/clean/summarize data as required
#Run analyses (often multiple steps)
#Wrangle/reformat/summarize analysis outputs for visualization
#Visualize outputs as figures or tables
Version control tools can be challenging , but also hugely simplify your workflow!
The advantages of version control1:
repositories
(“repos”) or gists
(code snippets)cloning
the repo to your local PC. You can “push
to” or “pull
from” the online repo to keep versions in synccommits
commit
ed with a commit message
- creating a recoverable version
that can be compared
or reverted
forking
repos and working on their own branch
.
pull requests
owners
can accept and integrate changes seamlessly by review
and merge
the forked branch back to the main
branchcommit
or pull request
s provide a written record of changes and track the user, date, time, etc - all of which and are useful tracking mistakes and blaming
when things go wrongassign
, log and track issues
and feature requests
Interestingly, since all that’s tracked are the commits, whereby versions are named (the nodes in the image). All that the online Git repo records is this figure below. The black is the the OWNER’s main branch and the blue is the COLLABORATOR’s fork.
Artwork by @allison_horst CC-BY-4.0
Artwork by @allison_horst CC-BY-4.0
Sharing your code and data is not enough to maintain reproducibility…
Software and hardware change between users, with upgrades, versions or user community preferences!
You can document the hardware and versions of software used so that others can recreate that computing environment if needed.
sessionInfo()
function, giving details belowR version 4.4.3 (2025-02-28)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Africa/Johannesburg
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.5.2
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.5 knitr_1.50 rlang_1.1.6
[5] xfun_0.51 generics_0.1.4 jsonlite_2.0.0 labeling_0.4.3
[9] glue_1.8.0 htmltools_0.5.8.1 scales_1.4.0 rmarkdown_2.29
[13] grid_4.4.3 evaluate_1.0.3 tibble_3.3.0 fastmap_1.2.0
[17] yaml_2.3.10 lifecycle_1.0.4 compiler_4.4.3 dplyr_1.1.4
[21] RColorBrewer_1.1-3 pkgconfig_2.0.3 rstudioapi_0.17.1 farver_2.1.2
[25] digest_0.6.37 R6_2.6.1 tidyselect_1.2.1 dichromat_2.0-0.1
[29] pillar_1.10.2 magrittr_2.0.3 withr_3.0.2 tools_4.4.3
[33] gtable_0.3.6
If your entire workflow is within R, you can use the renv package to manage your R environment.
renv
is essentially a package manager.
It creates a snapshot of your R environment, including all packages and their versions, so that anyone can recreate the same environment by running renv::restore()
Disadvantages are that it doesn’t manage for:
Use containers like those provided by software like docker or singularity.
Containers provide “images” of contained, lightweight computing environments that you can package with your software/workflow to set up virtual machines with all the necessary software and settings etc.
You set your container up to have everything you need to run your workflow (and nothing extra), so anyone can download (or clone) your container, code and data and run your analyses perfectly every time.
Containers are usually based on Linux, because other operating systems are not free.
The Rocker project provides a set of Docker images for R and RStudio, which are widely used in the R community.
This is covered by data management, but suffice to say there’s no point working reproducibly if you’re not going to share all the components necessary to complete your workflow…
Another key component here is that ideally all your data, code, publication etc are shared Open Access
::::
The key to iterating your workflow, especially for forecasting.
Many options!
targets
The project aims to develop a near-real-time satellite change detection system for the Fynbos Biome using an ecological forecasting approach (www.emma.eco).
The workflow is designed to be run on a weekly basis, with new data ingested and processed automatically.
There are several steps, each of which is run automatically:
Outputs a Quarto
website, automatically built from a GitHub repository.
Processing and analysis done in R. Intermediate and final outputs stored as GitHub releases or in GitHub Large File Storage.
R workflow managed by the targets
package
GitHub Actions
used to automate and run the workflow
Docker
container sets up the computing environment
All code, data, metadata, etc are shared on GitHub
targets
Workflowtargets
workflow from https://wlandau.github.io/targets-tutorial/#8targets
is an R package that allows you to define a workflow as a series of steps, each of which can be run automatically.
The package identifies which steps are out of date and runs them and their dependencies, but ignores unaffected steps, saving computation.
In EMMA, the workflow is defined as a series of R scripts, which is run automatically by GitHub Actions on a weekly basis, triggered by a GitHub runner. targets
keeps track and controls which steps have been run and which need to be rerun depending on new data inputs, etc.
testthat
and RUnit