(This post is migrated from this post on Samuels old tech blog)
Explicit knowledge is too expensive
There are lots of things that can't be answered by a computer from data alone. Maybe the majority of what we humans perceive as knowledge is inferred from a combination of data (simple fact statements about reality) and rules that tell how facts can be combined together to allow making implicit knowledge (knowledge that is not persisted as facts anywhere, but has to be inferred from other facts and rules) become explicit.
One can easily imagine though, that storing every single piece of knowledge that could be stated, as an explicit fact, would require more storage than can probably ever be made available in this universe.
Simulations can make knowledge explicit, from first princples
It is not too hard to come up with some processes which are just too complex and involves too much variability1 that it is unrealistic to try to capture every imaginable state of of that system or process in explicit facts. Instead we must seek the "first principles" that define the process, and through simulations make explicit any knowledge we are looking for, at the time we need it (one can of course cache often accessed knowledge).
So; Simulations can (temporarily ) make explicit, the knowledge we are looking for, but which does not exist in explicit form.
Simulations should be automated
There are lots of simulation software for biological systems out there ... deterministic, stochastic and agent-based ones to mention a few categories. The plethora of different systems to choose from does not make life easier for the bench scientist. And in order to improve that situation, and provide some automation, a standardized way to deal with simulations is needed.
What is done already?
There are some efforts to increase interoperability, most notably through the SBML standard, a standard file format for molecular biology models (pathways, gene regulatory networks, and the like). But how to make things more "automatable", as is one of the main goals of semantic web?
Still more to do
I'm thinking whether that is still enough though. If wanting to automate knowledge discovery using these kind of systems, one needs to capture in a computer-readable way also the outcome of simulations, not only the underlying model2.
So, all in all, I think there is some work to be done in wrapping simulation software into a "semantic shell" that knows all it needs to understand the the language of incoming data (it might even need to be able to produce such data, from semantic queries), and also can analyze the simulation results, and provide the questioner with an answer in a semantic way.
Thus, by wrapping simulations in semantics, one might be able to automate answering of questionswith no answers! (at least, that's the idea :) )
Update (March 2010): Building blocks already there
When I wrote the post above, I had not imagined that the building blocks for this was already done, which is what I realized when looking at the picture at the bottom of the page on the BioModels website.
The BioModels initiative itself (which collects systems biology models which can be used by others) is highly interesting, but what I found extra interesting was that obviously they have both data formats and ontologies for the three main parts which probably need to be addressed, for wrapping simulations in semantics:
- Model description (Format: SBML, SBGN | Ontology: SBO)
- Simulation description (Format: SED-ML | Ontology: KiSAO)
- Simulation results description (Format: SBRML | Ontology: TEDDY)
So, it seems that the idea (wrapping simulations in semantics to create an automated system for answering questions whose answers are hidden in simulation results) might not be so unrealistic after all.
- 1. ''Think of the embryonic development process for example, and then add dimensions like; species, environmental factors, mutations etc. etc. Say for example that we are looking for the expression level of gene X, in the compartment Y, in the species Z, after A days of growth, with a temperature that goes from +17 C to + 4 C in a gradient during that A-day period''. You can easily see that there are quite some possible combinations of factors affecting the state of the system in every time step ... .
- 2. Most probably, principles from the [http://sadiframework.org/ SADI framework], will come useful here