How statistics married algorithmics and in the process changed the scientific method
[Notes for my talk at “Information Beyond Shannon” held in Venice, December 29th 2008]
In the summer of 1609, nearly exactly 400 years ago, Galileo Galilei was here in Venice, trying to sell his telescope to the Doge, in return for tenure. He had not really invented it, as this was the creation of Dutch spectacle-maker Hans Lipperhey, of which he had heard a description. He greatly developed it, and offered it to the Venetian Fleet as an aid to navigation and early detection.
During 1609 Galileo perfected his lens grinding skills, experimenting with methods and designs. He created various models and analysed the principles behind optics. As a result of this investigation, the quality of the tool was greatly enhanced, and new designs became possible.
He could have started a business making telescopes, or magnifying glasses, or spectacles. He could have been satisfied with the wage he received from the Republic of Venice.
But Galileo was a scientist, not just a tool maker. Although he did design, create, and test some of the best tools of his time, he was not just concerned with the engineering aspects of his work, and the commercial opportunities. As a true scientist, he was interested in understanding the world around him, something that would get him into trouble more than once.
In the summer of 1609, at age 45, he turned the telescope to the sky, and started his investigation of the Moon. He discovered mountains and valleys, by observing their changing shadows. Most importantly he discovered that the Moon – contrary to Aristotle’s opinion – was not a perfect sphere. Something was wrong with the established model of the Universe.
Later on he discovered with the telescope that Jupiter was orbited by 4 Moons, and this showed that in at least one case, things did not revolve just around the earth. Then with the same tool he discovered that Venus has phases, just like our Moon.
In fact, he realised, Aristotle was wrong, the Earth and Venus and Jupiter orbited the sun, the Moon orbited the Earth, like the 4 Moons of Jupiter orbited their planet. And the Moon – at least – was not a perfect sphere, but had mountains, and he could measure their height by measuring their shadows, and predict which of them would come out of the dark first, every month. What he had been taught was wrong.
His work was published really fast, in March 1610, in a short booklet entitled “Sidereus Nuncius” (Starry Messenger). This work was important not because it had direct implications on how we did things on Earth – although that too – but because eventually was responsible for a fundamental revolution of our thinking. Its implications were theological, and landed him into trouble with the Church, among other things. His observations forced him to question the received wisdom, and this is always an act of challenge, although one that is expected of scientists. These implications were also philosophical, and methodological.
In fact, that was a very early example of modern systematic scientific investigation. A scientific instrument was used to make observations, mathematical relations were derived for the geometry on the Moon, and predictions were used as a way to validate the models.
For this and many other contributions, Galileo is associated with a major shift in scientific method, which is the very topic I would like to discuss today.
The systematic way we use to derive and represent unambiguous knowledge, so that it has predictive and explanatory power over the world, is a major achievement of our culture. Not all cultures focused on a systematic approach to knowledge acquisition and revision, see for example the Romans. There are many ways of knowing the world, and this is a systematic organised process to produce knowledge that is reliable, and remove that which is not.
Over the centuries, we have started gathering knowledge in an organised process, involving a cycle of experiment design and hypothesis generation, representing the results – wherever possible – in unambiguous mathematical terms. This has been the accepted way in which we do science for the past few centuries, but is not the only possible way.
In fact, the scientific method has been in constant evolution for a long time. The same can be said of the practices we follow as a research community, with anonymous peer review and publication of results a crucial part of the ritual of science.
Observations lead to competing models, and this leads to experiments, and their outcomes are used to revise the current models, and this in turn suggests new experiments, and so on, in a loop. The discovery of mechanical laws can be seen in this light, with competing intuitions about masses, accelerations and frictions, leading to key experiments. In most cases, these loops are much more complex and interconnected, but the interactive nature of the modelling process is often very visible.
But things are changing fast. Now the process is going through a sort of “industrial revolution”.
Data are gathered automatically, by computers or even robots, effectively acting as massive measurement apparatuses, replacing what were for Galileo the thermometer or the clock. Increased accuracy and ubiquity of measurement devices result in vast repositories of experimental data, stored in dedicated disk farms.
We can look at the examples of Physics, Molecular Biology, Drug Design and Astronomy. They all exemplify the same trend in science.
The Large Hadron Collider at CERN is a machine designed to produce experimental data, potentially 15 Petabytes per year. The engineering challenges in producing, storing and managing this amount information have reached epic proportion. But it is the analysis of this data that is truly mind boggling. And this experiment can be seen – in a way – as the direct descendent of physical experiments initiated 400 years ago by Galileo: the systematic investigation of the basic laws of nature has led us to this point.
Similar challenges are encountered by today’s biology. The direct descendents of Mendel’s painstaking collection of genetic inheritance data are experiments aimed at the full sequencing of thousands of genomes at once. Terabytes of data are now produced by each of the new generation of sequencing machines, and the Sanger Centre in Cambridge is now working on the 1000 genomes project. Hundreds of species have now been fully sequenced, and we are well down the road of comparing multiple complete sequences within the same species.
In drug design, it is standard to test compounds to see if they bind to a given target, by exhaustively testing entire libraries of chemicals, by use of robots, in what is called combinatorial chemistry. Hundreds of thousands of compounds can be generated and tested, either by using robotics, or – increasingly – by computer simulations, in what is essentially a survey of entire regions of chemical space, hunting for compounds with a given set of properties.
Astronomy – another child of Galileo’s – is now done by automatic surveys of the night-sky run by computers, and by subsequent automatic analysis of the images and data gathered in this way. One such project is the Sloan Digital Sky Survey (SDSS) created a 5-wavelength catalogue over 8,000 square degrees of the sky, containing about 200 million objects, described by hundreds of features (data released incrementally to the public).
The SDSS used a dedicated 2.5-meter telescope at Apache Point Observatory, New Mexico. The 120-megapixel camera imaged 1.5 square degrees of sky at a time, about eight times the area of the full moon. A pair of spectrographs fed by optical fibers measured spectra of (and hence distances to) more than 600 galaxies and quasars in a single observation.
The database generated over 8 years by this automated survey is several Terabytes large, presenting serious challenges to data management and mining.
In fact, this point is the key of this discussion, and is common to all the examples mentioned above. There is no way that people can analyse the data produced at LHC, or at Sanger Centre, or by sky surveys. They can only be conceived because we can rely on computers to do the analysis of data for us.
And this is the point I want to make: our scientific method has changed. The revolution is not a matter of detail, or even quantity. It is a matter of quality. We have industrialised both the production and the analysis of experimental data. We have industrialised the generation of scientific knowledge.
The automatic analysis of patterns in data, the automatic generation of hypotheses, are fundamental parts of science. This is how computer science, statistics, maybe artificial intelligence, are finding their way to the core of all science, and to the core of how we know our world. This is how they are at the centre of a revolution that will have significant consequences.
A Newer Method.
The automatic analysis of data, in search for significant – if elusive – patterns, is now a key part of many scientific experiments, and this is an increasing trend.
Statistics and computer science, and the convergence of dozen of smaller disciplines, create a conceptual and technical framework and body of knowledge that I call Pattern Analysis. It includes tools to extract significant information from networks, images, strings, text, bio-sequences, vectors, time series….
The information created and manipulated by machines today is not the same information we study in our introductory courses of Information Theory.
We may think that the process of scientific discovery will not be fully automated until machines will be able to generate complete theories of a domain, with their formalism and equations.
This deserves 2 fundamental responses: 1) this is not out of reach for machines 2) this is not necessary for machines to be doing science.
I will focus only on point 2. We all think that the output of a scientific investigation such as Einstein’s should be a set of equations, and their interpretation, that can be used to work out predictions or models, for specific outcomes and specific experiments. We focus a lot on analytic manipulations of these general equations, as an example of abstract knowledge manipulations.
But this is not necessary to science. The output of the scientific process does not need to be a set of differential equations – although this is what we have come to expect from Physics. These equations are useful only when they are applied to particular systems and situations, and calculus is used to specialise them to these situations. Then simple mathematics is often used to make predictions about the prediction of systems under various conditions.
Being able to make these predictions is the only reason why we have these equations, and we represent knowledge in the language of functions and equations because that was the best way to represent and compute in the past.
What if we had a computer that can make the same predictions without needing to start from a set of high level equations, but instead starting from a set of relations discovered in data?
Just as these equations derive their meaning from their use, one could argue that predictive patterns discovered in data could play a similar role.
Besides, it is quite possible for machines to summarise these patterns in compact theories, only to deduce them back when needed from the basic axioms. But would that be useful?
When was it in history that we started considering ‘explained’ a phenomenon when we had a few differential equations describing its dynamics? It surely must have started in mechanics, perhaps with Newton. But these equations are ultimately combined together, and with observation of initial conditions, in order to derive predictions. What if we could just derive predictions from initial conditions and knowledge that is represented in a different way?
Patterns extracted from data can reliably be used to make predictions – just ask Amazon.com or Google.com – without the need to formulate the knowledge contained in them in the form of differential equations or equivalent unified theoretical description.
We could even mention the role of large scale computer simulations, in this respect, as a way to both make predictions and to test hypotheses.
But what matters is that at the centre of this paradigm shift is our capability to gather, store, manage and analyse massive amounts of data automatically. And this is the permanent marriage between statistics and computer science – and many other sub-disciplines – that we are discussing today.
These tools, just like Galileo’s telescope, were perhaps not derived for doing science, but very often for doing business. But just like Galileo, we can turn them and use them to change the way we understand our world.
And the fact that we are using off-the-shelf hardware to produce data, and to manage and store it, and we are using commercial software to analyse it, can only signal that further accelerations are to be expected, as costs are driven down.
My little research group makes extensive use of pattern analysis technologies that were originated for practical or industrial applications. We enjoy developing them. But we do it for a scientific reason too. We are currently experiencing our share of new-science, by turning our attention to another type of sphere, much like Galileo did with the Moon. We are looking at the contents of the global media-sphere, to understand how something so common and under everybody’s eyes actually works. This just cannot be done without the right tools.
We are interested in observing (and modelling) how ideas flow and interact, as they traverse the media system; every outlet can pick and choose whichever news it wants to carry; each user can choose whichever outlet they want to read; complex dynamics regulates the resulting process of information selection and diffusion; but simple patterns emerge, if we look in the right place.
But reading the contents of the mediasphere, for a machine, means being able to understand human language – at least to some extent. And this is a totally new ingredient that we can add to the mix, today. Machines can actually read and understand text. We are translating every day from 22 languages, and we are reading 1100 news outlets. We have found 450K named entities, for example, exhibiting a perfect power law of popularity, and interesting relations such as a 3-fold extra interest in the Pope found in Spanish language media over English language media.
We are detecting text re-use, with massive scale implementations of suffix trees, and tracking memes as they spread through the 10K outlets forming the global media sphere. We are recreating social networks, and detecting biases in the choice of topics and words in various types of outlets. We even measure readability.
Social scientists have been interested in understanding the media system for decades, but their investigations could only be performed by hand, on limited number of outlets, time spans, and topics. A true constant monitoring of all outlets and all topics in all languages is now within reach, and automatic analysis tools are becoming available.
But there is even more in store, when it comes to changing our scientific practice. The data revolution is also changing the way we publish our results. Peer review has been around for a few centuries, and has been a very important tool in the way of objective and reliable publication of results. But it is only one of the many ways in which this can be done. Google-rank type of scores could be used to assess the significance of contributions.
The year before becoming lost-at-sea, in 2005, Turing award winner Jim Gray was touring the US giving his lecture on the future of science, that he saw as being e-science. He gave his talk at UC Davis, when I was there, and I was impressed by how he managed to connect what is happening to geography to what is happening to biology and any other science. “Any science X now has a computational-X version” he said, and this affects the way we gather and analyse data, as well as the way in which we share data and results.
The examples of pub-med and Genbank could be followed by other sciences, in the future, with a tight integration of results, data and methods, sharing and globally creating a single unified resource.
The scientific method is today evolving faster than ever. The automation, systematisation and industrialisation of information gathering and analysis, are accelerating the rate at which we expand our knowledge of the world. Machines now produce knowledge about our very own biology. The proportions of this transition should not be underestimated, and the science of patterns, information and knowledge is at the centre of this storm.
Galileo Galilei could have kept on making hi-tech tools and gadgets, and would certainly have found enough customers to make a comfortable living. But he was a scientist, and he used them to understand the world around him. In the process he used mathematical representations of the laws that he discovered, experiments to gather data, and overall very modern methodology. He also got into trouble with the authorities, because he refused to keep his telescope aimed low enough, and refused to ignore what he saw with it.
A new generation of scientists, with a new generation of tools, can now do the same, and gather unprecedented types of data, and draw far reaching conclusions about our world. The automatic collection of data in genomics, chemistry, astronomy, physics and also the social sciences, will revolutionise the way we see our world, and will further an understanding of it as a single interconnected system.
But we also need to keep in mind another aspect of Information Beyond Shannon: the same combination of novel data gathering and data generation techniques is not just changing the scientific method, but also our very notion of privacy. Surveillance technology, combined with data analysis, form a powerful and unsettling mix. The costs are so low now, and the technology so widespread, that we have to conclude that it is not just the way we do science that is changing. Some more fundamental notions, involving our individual rights, are also evolving fast and law-makers have not been able to keep up.
As we develop newer and stronger and cleverer tools to squeeze information out of inexpensive databases formed by today’s technology, we should also feel an obligation to develop a strong conceptual framework to think about privacy and possible abuses of this technology.
- Kevin Kelly (evolution of scientific method)
- Cory Doctorow, on Nature
- Sanger Center