In-Depth

The Science of Software

Microsoft and the scientific community work together to transform the world of computing.

Imagine a computer language that's used to create biological constructs instead of spreadsheets. How would you like to write code that modifies cells, invents smart drugs or turns the way plants process energy into new power sources for humans?

These are just a handful of the goals of the Microsoft Research European Science Program, a 2-year-old group that teams Microsoft researchers with some of the world's top scientists. Run by Stephen Emmott, the group hopes to blend computer science and traditional science, a combination that could finally crack some of the world's most vexing problems.

By understanding biology, we can fight disease. Through climate modeling, we can attack global warming. And by merging computing and science, we can advance both, Emmott believes.


Emmott's group will soon be working with some 100 scientists on projects to build new software tools to tackle these problems. In the process, the world of software development, even corporate development, could well be turned on its ear.

What can corporate developers learn from the scientific community? Scientists are confronting the same issues as enterprises and ISVs. New computing paradigms are already arising based on major leaps in hardware. For instance, the rise of multi-core processors as standard equipment in PCs, laptops and servers calls for a fundamental shift in programming and software architecture, one that assumes that processing can be distributed and is done in parallel.

In fact, these hardware shifts may well call for a new version of Moore's Law, where processing power isn't measured by the number of transistors per circuit but adapts to the notion of parallel processing.

Computer model of a chemical wave
Computer Model of a Chemical Wave

The aim of Microsoft's research is to help create fundamentally new and different computing paradigms. Underlying the changes are certain hardware advances: faster chips, multi-core processors, larger and faster storage and the like. There are even efforts to build radical new styles of hardware, such as quantum computers that store ones and zeroes in ions or atoms.

But it's software that will do the heavy lifting.

For scientists, a key aspect of the challenge is ensuring that data remains close to the applications and users in a distributed computing environment. Microsoft believes that robust, peer-to-peer networks and replicated databases for shared access will enable researchers to effectively share programs and data.

Advanced data management techniques perfected in the scientific community may be adopted by corporate developers and database professionals. On the software side, Service-Oriented Architectures (SOA) will be critical to next-generation distributed environments, allowing distributed programs and components to recognize each other, interact and share information.

A Scientific Software Approach
The science community requires a high degree of development discipline. Recreating software each time isn't just time consuming and expensive, it's risky. Unproven code can undermine observations and threaten to discredit years of work.

Shareable components and libraries and robust standards for interoperability play a critical role. Third-party libraries become increasingly critical to building applications that can be extended and used for years to come. At the same time, scientists will offer their own custom libraries, often in open source form, back to the community.

Software components, according to Microsoft, include not just source and executable code, but the "expertise" of their builders. In a document titled "Towards 2020 Science Report," created by Microsoft Research and its counterparts from the science community, Wolfgang Emmerich of University College in London, Simon Cox of the University of Southhampton, and Clemens Szyperski and Don Syme of Microsoft make this point.

"There is a strong need to form collaborative communities that share architecture, service definitions, services, components frameworks, and components to enable the systematic development and maintenance of software assets," the authors write in the paper.

Scientific development not only has to deal with massive and growing quantities of data and collaboration, it must also handle new and increasingly complex approaches to problem solving and algorithms. These algorithms are far from static; they change dynamically based upon findings and new data sets.

Just as data needs to move closer to the scientist, so too does the programming. New and simpler programming paradigms must be built so scientists can more easily write software.

The "Towards 2020 Science Report" highlights the use of a proven Microsoft software model to help improve scientific productivity. Just as Microsoft Office acts as a front-end for ERP and other core business apps, by 2015 the company could be providing a similar front-end interface to scientific applications.

Microsoft hopes to give scientists, the paper says, "a front-end that is something akin to an 'Office for Science': an extensible, integrated suite of user-facing applications that remain integrated with the development environment that help address the many human-computer interaction issues of the sciences."

The Codification of Knowledge
Scientific -- and, indeed, business-computing is not driven by rows and tables of arbitrary numbers. Data has meaning, and for scientific understanding, context is everything. In drug discovery, for instance, researchers need to make sense of the reams of data produced by experiments and the automation that high-performance computers can bring to bear. Reigning in all this data falls under a discipline called "computational knowledge management."

For science, this means mixing the results of experiments with data gathered from the general population. And this is not just all numbers, but images, video, text and other unstructured data. Scientists need intuitive ways of querying, organizing and analyzing these data types, ranging from natural language processing to intuitive conceptual analysis.

The same techniques can be applied to corporate databases. Instead of customers being represented by social security numbers and other bits of data, more direct and less abstract representations can take their place, such as photos, voice files and even video.

Emerging computing technology also enables the codification of knowledge. For Microsoft Research, codification means turning knowledge into a discrete program or data that can be manipulated by computers. That ability promises to unlock hidden meaning within large data sets to drive active understanding. A great example is the Human Genome Project, which has helped researchers understand the very building blocks of life.

Codification has another purpose, one that developers in general could well apply. As knowledge is organized and prepared for manipulation through software, it can be presented in all new ways. For example, scientists can create research journals based on databases rather than textual interpretations of the data. By linking these next-generation journals to underlying data, the entries become dynamic, changing as the data changes.

Managing data is one thing, but enabling true understanding is quite another.

"The biggest problems are not simple number-crunching problems," Emmott explains. "They require huge advances in algorithms, sophisticated algorithms, in developing entirely novel approaches to what you might want to call the codification of scientific knowledge. That requires the instantiation of these powerful algorithms rather than just raw computing power. The instantiation in entirely new forms of software tools."

Artificial vs. Living
As computers and science advance and converge, one has to ask how to tell the difference between the artificial and the truly living. In fact, smart computers will be able, more and more, to create life, or at least create new biological parts. These parts could be formed into new organisms, experts argue.

A Brief TimelineThrough a deeper understanding and codification of the human genome, the idea of building cell factories that create relatively complex cells becomes feasible. And because these cells can self-replicate, new medicines and chemicals can be produced in useful volumes.

These aren't simple cells -- they can be custom cells that do specific things. And because these cells can interact, communicate and adapt to one another, they can form the basis of true biological systems. This is the notion of synthetic biology.

Pretty cool, but scientists can also blend these newly formed cells with ultra-tiny computers so that the cells can report back and make decisions based on artificial programming.

These ultra-tiny computers are based on proteins, DNA or other so-called micro-molecules. While molecular computers may not be able to play "Call of Duty," they can at least understand a basic input and, based on that input, produce output or make decisions.

Molecular computers could understand the state of the cells in which they reside and decide upon treatments, such as releasing drugs or stopping the flow of drugs. This forms the foundation of the smart drug concept.

"In the shorter term, smart drugs are targeted drugs based around knowing something about a person's genome and the immune system. In the longer term, smart drugs would be biological computers that operate on a nano scale, literally based upon exactly the same principals as a silicon computer but made up of biological molecules rather than silicon," Emmott says. "They could be injected into a cell and act as a doctor in a cell -- being able to detect the early signs of disease, which is typically gene over expression or under expression, and being able to work out the most effective therapy for that particular symptom or early signs of disease. That's decades away."

While small, these tiny computers include many of the elements that define today's most sophisticated computing systems.

"This molecular computer-as-smart drug utilizes several key computer science concepts: information, computation, input/output relation, propositional logic, stochastic automata, modularity, interfaces, parallel computation, specification versus implementation, interchangeability of program and data, interpreters, reliability, and building reliable systems from unreliable components," writes Manual C. Peitsch, David B. Searls, Elud Shapiro, and Neil Ferguson in the "Towards 2020 Science Report."

Systems Biology
Perhaps the biggest area of inquiry is systems biology, which studies how various biological components work together.

"The Human Genome Project was this triumph at the end of the 20th century of computing and biology and mapped the entire blue print for life of a human organism out for the first time," Emmott says. "What it did was demonstrate -- to everyone's surprise -- that mapping the human genome didn't actually tell us very much about system behavior and system organization. By system I mean of the whole organism -- its organization, its structure and its behavior."

Systems biology, by explaining the relationship between, say, proteins and genes, can get us to the next step. "Systems biology's aim is to understand, from a system level, how an organism functions and how it's organized and how it processes information -- rather than just looking at a parts list," Emmott explains.

So where is this all leading, and what does it mean for the world of software? Nothing less than the building of programming languages for biology.

The world of science is outgrowing older languages such as Fortran, Microsoft researchers believe. New languages will comprise a mix of compiled and interpreted code, and will support both parallel and asynchronous programming models. Scientific applications will also be built with a blend of languages.

The idea is to create programming languages "representing the structure and function of biological systems via formal languages, for description, simulation, and analysis and [eventually] compilation," writes Luca Cardelli, head of programming principles & tools for Microsoft Research, in the "Towards 2020 Science Report."

Emmott says a computational model built from bio-informatics can help scientists build a programming language that is rich and executable. "It is literally like a programming language that is modular, that is highly parallel, as biological systems are and as concurrent as computer languages are increasingly going to need to be. And which is highly compositional, again which is what computer languages are or increasingly need to be," Emmott explains. "So you might end up with a computer language that can specify the operation of a particular bio-chemistry and then build modules that fit together like libraries in a computer system -- and which can make the various aspects of bio-chemical pathways and the bio-chemistry operate together to produce some greater higher level systems behavior," he says. Such a language could, for instance, produce proteins.

Scientists could push it even further and "build a language that can describe much more complex system-level behavior -- for example, the immune system and how the immune system works by using concepts from computer science, from software engineering and from programming theory. Through process algebra and formal verification and testing, you can potentially create a very powerful language ... for specifying the operation and processes that go on in a complex biological system," Emmott continues.

"That's a good example of combining new conceptual tools from computer science, in this case processed algebra, the concepts such as abstraction, concurrency and compositionality, and instantiating those and implementing those into new technological tools. The formal name for this is a stochastic pi machine, for building a language and a system for modeling highly complex biological systems."

Brave New World
As computing pushes science to new heights, the science of computing itself advances. "Advances in science, especially biology and chemistry, could create the building blocks of a fundamental revolution in computing," the Microsoft report argues. Computers made of biomolecules are one possibility.

Emmott agrees: "In the next 10 to 20 years, the tools we're starting to research and develop, especially for biology, could uncover remarkable new insights into the processes and information processing of biological systems that reveal how to create entirely new building blocks for computing. [These could] be based upon how biology and biological systems, which through three and half billion years of their own R&D, have developed elegant solutions to information processing that are unimaginable today by today's computer scientists and engineers."

Reader Comments:

Tue, Sep 2, 2008 Anonymous Anonymous

Very amazing site
Thanks, webmaster.

Add Your Comment:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above