In our work on automating machine learning computations in cheminformatics with scientific workflow tools, we have came to realize something; Dynamic scheduling in scientific workflow tools is very important and sometimes badly needed.
What I mean is that new tasks should be able to be scheduled during the execution of a workflow, not just in its scheduling phase.
What is striking is that far from all workflow tools allow this. Many tools completely separate the execution into two stages:
- Scheduling tasks to be run
- Executing those tasks
The case where we really have needed this was for running machine learning algorithms on data sets of various data set sizes. To gain optimal models, we have been first first optimizing the cost parameter to our training step by running a parameter-sweep over a set of cost values. The performance of training with different cost values is then evaluated and an optimal cost is chosen. And now comes the interesting part: We want to schedule a defined workflow with this newly selected cost value. This is not easily done in Luigi even with our SciLuigi extension since Luigi separates scheduling and execution, but also since parameters to workflows must be initiated at scheduling time. Thus we can not use a value resulting from calculations within a workflow run to start the next task, inside the same workflow.
Of course we found a work-around for this: We created a task that takes the chosen cost value and executes a shell command to start a separate python process with that other part of the workflow. It works, but it results in various problems. For one, things are not closely integrated, we get extra overhead and the separate workflow instance will create separate logging, audit files etc. Thus, this is something I would like to see in the next workflow system we use: Ability to schedule new tasks continuously from during execution of a workflow.
Interestingly this is a feature that comes for free in tools that adhere to the dataflow paradigm. In most dataflow tools you have independently running processes that receive messages with input data that continuously schedule new tasks as they receive messages until the system hands them a message telling them to shut down. "Dynamic scheduling" is really how dataflow systems work in other words, which I find interesting. I think the dataflow system Nextflow works like this. And so does my little experiment in a pure Go workflow library, which I started hacking on out of frustration with some other tools a long time ago, although that one still lacks most other popular features ;)
I just had not realized how important this feature could be, for very common use cases.
EDIT 2020-10-19: Fixed some typos and improved flow of some sentences.