A pre-print for our Go-based workflow libarary SciPipe, is out, with the title SciPipe - A workflow library for agile development of complex and dynamic bioinformatics pipelines, co-authored by me and colleagues at pharmb.io: Martin Dahlö, Jonathan Alvarsson and Ola Spjuth. Access it here.
It has been more than three years since the first commit on the SciPipe Git repository in March, 2015, and development has been going in various degrees of intensity during these years, often besides other duties at pharmb.io and NBIS, and often at a lower pace than I might have wished. On the other hand, this might also have helped to let design ideas mature well before implementing them.
There is definitely no lack of workflow tools in bioinformatics, which lists like this attest to, and there are also a core bunch of tools that have crystallized as popular tools among a wide range of users. So why would we still spend all this time on developing yeat another workflow library?
Well, let me start with saying that I think development of new tools and libraries can serve more than one purpose. Despite the many tools available, many users of these still experience many limitations with them. Furthermore I think development of new tools and approaches doesn't need to be seen as necessarily new competitors to other tools, or fragmenting the existing tool landscape, but rather as experimentation. Successful approaches could later be adopted by other tools too, which might not be as free to do experimentation when there is a large user base already. A kind of research, that is. The many frustrations some users experience with workflows suggests that this is a well needed field of research as well, to make computational science more robust, reproducible and also understandable.
With that said, to explain the development of SciPipe, I think the easiest is to briefly tell the story of our use and research in workflow tools over the last couple of years. Here comes.
We have spent the last couple of years using workflow tools to build predictive machine learning models in early drug discovery. Based on extensive research of existing tools, we first settled on the Luigi library for this task. We quickly learned though that machine learning workflows have specific requirements not common in general bioinformatics pipelines, such as very high number of tasks due to nested parameter sweeps and cross validation folds, and the need parametrize the final training with hyper parameters optimized during the workflow run, creating the need for so called dynamic scheduling.
Luigi was not completely living up to our requirement, and so we tried to fix the situation by developing a helper library, SciLuigi (previously published in J Cheminf). After a year of using Luigi+SciLuigi, we started to experience limitations with this solution too though. It still lacked dynamic scheduling and we experienced performance and robustness problems with Luigi’s interpreted Python core, which slowed down development a lot.
To solve these remaining problems we took our experiences from SciLuigi and applied it on a flow-based programming (FBP) inspired scheduler in Go (if you can call it a scheduler, as scheduling happens implicit as part of the dataflow/FBP design!), resulting in the SciPipe library.
We have subsequently used SciPipe over the last year for building machine learning models in drug discovery, and are so far extremely pleased with our experiences. The performance problems are gone and the dynamic scheduling lets us keep the workflow definition integrated, so we can produce coherent audit logs and workflow graphs, among other things.
We thus think SciPipe can fill an important role for complex workflows in machine learning in particular, but we have also made sure that it works well for more traditional genomics and transcriptomics applications, to avoid the need to use different workflow tools for each problem. More info on this is available in the preprint.