Are We There Yet?

Are We There Yet?
Nello Cristianini– University of Bristol
[NOTE: this article is currently submitted for publication, and is based on my Keynote Speeches of ICANN 2008 and ECML/PKDD 2009]

Abstract
Statistical approaches to Artificial Intelligence are behind most success stories of the field in the past decade. The idea of generating non-trivial behaviour by analysing vast amounts of data has enabled recommendation systems, search engines, spam filters, optical character recognition, machine translation and speech recognition, among other things. As we celebrate the spectacular achievements of this line of research, we need to assess its full potential and its limitations. What are the next steps to take towards machine intelligence?

Machine Intelligence, AD 1958
On November 23rd, 1958, a diverse group of scientists from all around the world and from many disciplines, gathered near London for a conference that lasted 4 days and involved about 200 people. The topic was: can machines think?

The Conference was called “On the Mechanisation of Thought Processes” and its proceedings encapsulate the zeitgeist of those days, and give us a chance to reflect on the achievements and directions of research in Machine Intelligence.

That group of engineers, biologists, mathematicians, represented both the early ideas of Cybernetics and the newly emerging ideas of Artificial Intelligence. They were brought together by the common vision that mental processes can be created in machines. Their conviction was that natural intelligence could be understood at the light of the laws of science, a position spelled out in Alan Turing’s 1947 paper “On Intelligent Machinery” [11]. They also believed that it could be reproduced in artefacts.

Their common goals were clearly stated: understanding intelligent behaviour in natural systems and creating it in machines. The key challenges were identified and named, in the Preface of the proceedings: “This symposium was held to bring together scientists studying artificial thinking, character and pattern recognition, learning, mechanical language translation, biology, automatic programming, industrial planning and clerical mechanisation. It was felt that a common theme in all these fields was ‘the mechanisation of thought processes’ and that an interchange of ideas between these specialists would be very valuable”.

A further look at the two volumes of the Proceedings reveals a general organisation that still is found in modern meetings in this area. Sessions were devoted to: General principles; Automatic Programming; Mechanical Language Translation; Speech Recognition; Learning in Machines; Implications for Biology; Implications for Industry.

The list of participants included both members of the Cybernetics movement (both from the UK Ratio club and the US Macy Conferences) and exponents of the newly growing AI movement. It included Frank Rosenblatt (inventor of the Perceptron); Arthur Samuel (inventor of the first learning algorithm); Marvin Minsky (one of the founding fathers of AI); Oliver Selfridge (inventor of the Pandemonium architecture, a paradigm for modern agent systems); John McCarthy (inventor of LISP, and of the name Artificial Intelligence); Donald MacKay (cyberneticist); Warren McCulloch (co-inventor of the neural networks model still used today); Ross Ashby (inventor of the concept of homeostasis); Grey Walter (roboticist).

Patterns in Personal Data

The Privacy Delusion

When we play computer-chess online, we do not expect our intelligence to be measured and compared with our school records, and we certainly do not expect those records to be sold to recruitment agencies.

As we shop for holidays or extravagant shoes, we do not expect this information to be given to debt collection agencies or credit-scoring organisations. After all, our everyday world does not work like that. Does it?

When we buy a newspaper, no one knows what articles we end up reading. Certain information is just for ourselves; like the name of the girl we liked at school or our passion for Abba music. Whether we exchange emails to organise a surprise party, search for information about a skin condition that is worrying us, or just take a walk on the beach, we do not expect our activities to be monitored. Do we?

What we do expect today is that a certain part of our life will be kept private, if for no other reason than because there are so many of us that we cannot imagine anyone making the effort to gather information about all of those things for every person. We literally hide in the crowd, relying on numbers – if not the decency of the others – to protect a little part of our personal sphere. These expectations might be misleading today or tomorrow, as the world is changing fast.

In order to be useful, information needs to be gathered, stored or transmitted, then processed and finally acted upon. Every step of this chain has undergone major transformations in the past decades, and can now be done automatically, cheaply, and very efficiently, by machines. Taken together, these technological advances have enabled a revolution in our society, but at the same time they can pose a threat to the privacy of our personal sphere. What information can be automatically gathered today (we discuss technical aspects here [1]) and how much about our personal lives can be inferred from it?

Let us take our love affair with electronic transactions. These include the use of debit cards to buy even a coffee, the automatic payment of our salaries into our accounts, the use of mobile telephones, the use of the internet to plan journeys and buy tickets, and so on.

As we conduct our life in this new and empowering digital age, we leave behind a permanent trail of personal data that is never deleted, and that is instead carefully analysed, and even traded, to model our behaviour and – in some cases – to try to influence it. Never before has so much data been collected about so many people

This talk takes a brief look at the surveillance of transaction space, from the point of view of “what is technically possible”.

Whether these data are used by humans or machines, and whether this distinction matters, will be discussed below. Let us have a first peek at a region of transaction space.

 

The AOL Case Study

Although most users will never be aware of that, it is standard practice of all search engines to gather and analyse a log file of all queries that each user has realised. Together with the content of the query, and the time stamp, information is collected that allows the analysts to identify the machine from which the queries have been made (cookies, IP addresses, in some cases user login information). This is done for a variety of technical reasons, but its potential was soon noticed.

Most end users will never be aware of the existence of log files, and indeed, it is not common even for researchers to have access to this information, as it is sensitive both for its commercial value and for its privacy implications.

But on August 4, 2006, AOL Research labs released a file on one of its websites containing the search log for over 650,000 users over a 3-month period, intended for research purposes. It contained about 20 million search keywords. All the transactions were anonymous, but each user was identified by a unique ID number, so that it was possible to connect queries performed by the same person, but not to identify the person.

Few days later, acknowledging this was an error, AOL removed the file from public access, but the data are still available at various internet locations, for all of us to explore. Those responsible were later fired.

Besides triggering a series of news articles and also various lawsuits, this incident gives us a rare glimpse into the – often invisible – backroom of online businesses. It allows us to have direct experience of a rather theoretical concept: the trail we leave in transaction space every day and, more generally, the amount of data that are automatically generated about so many of our daily activities.

[+] User 98280. Let us follow one particular user, just to see what information can be obtained. User 98280 is probably really a couple (2 users). There seems to be an abusive male – possibly addicted to cocaine – and a pregnant female, probably from Texas. The query log reveals a series of sessions – at different times – alternating in topic from ‘ovulation calculator’, to ‘pregnancy calendar’, to ‘first trimester of pregnancy’ to ‘effect of addictions on foetus’; with totally different queries for ‘girls gone wild’ and ‘fine black girls’; and – sadly- we also see queries like dealing with spouse that has bipolar disorder’ and spouse is cruel to everyone in family including animals’ and ‘coping with abusive spouse’ and ‘prayers for relationship problems’ or ‘healing prayers for people with bipolar disorder’ and ‘is bipolar disorder hereditary.

Slowly the story of 3 months of a pregnancy and a couple with serious problems unfolds in front of our eyes; a sad story of anxieties and problems that were never meant for public display. These searches reveal the most intimate details and anxieties, and even the intentions, of an individual who is mistakenly under the impression of being in total privacy.

[+] User 4417749. The New York Times tracked down some users, by exploiting ‘self searches’ and other give-away information. One of these users accepted to be interviewed and named: user 4417749 or Thelma Arnold, a 62-year-old Georgian widow. She ran a series of searches for the names of her family members, and for her neighbourhood, so that it was easy for reporters to track her down and interview her. Of course, others can be identified, but did not want to be named. But the issue of naming users is really a false one as most search engines have the information anyway, if you just login once. Furthermore, other searches are connected to your name, or your address, and those can allow you to be identified also if done when you are logged out What is important, here, is how many intimate details we are prepared to reveal to a search engine…

[+] User 1227287. There are also problem users, like 1227287, who is searching for bomb making instructions, as well as related information.

[+] User 927. One particularly disturbing search history – at the other end of the spectrum – is that of user 927, who appears tobe a very sick person, with an obsession for child abuse. A theatre play was even named after this user. How typical are 927’s queries? One could answer by performing a complete analysis of the 20 million entries of the file, to identify the most and least typical users. There are difficult questions about legal responsibilities of any analysts becoming aware of such search patterns. What should an analyst do, if likely criminal activity is suspected?

[+] User 637988. The question is posed very well by AOL user 637988 with the query “what is the ethical responsibility of the therapist when a student mentions suicide”.

If a person was to share with their analyst or doctor their own most disturbing fantasies, would the doctor be expected to call the police? In which cases? It does look like search engines are used as confessors, as counsellors, as doctors…. The expectation of privacy component is therefore a key issue to address. The data collected by a search engine amounts to a catalogue of intentions, mixed with a catalogue of fantasies, and a catalogue of curiosities, all bundled together.

In the example of the AOL query log, we only have anonymous search queries over a 3-months period, and yet we can discover so much about the intimate life of so many people. But search engines keep data for a much longer period than three months., They often also have the name of their users, as well as their address and banking details, if they provide e-shopping services.

Furthermore, much personal data is today easily available to various organisations. Some is private and is gathered as a routine part of business. Other is publicly available, or can be purchased (or rented) like any other commodity.

Do you want to buy a list of 10,000 alcohol-drinking, pet-owning, frequent flyers from the UK? Names and home addresses? Many companies can help you. Consumer Response, for example, will charge you a basic rate of 1,700 pounds for a list, plus extras for each extra attribute of the consumers you want to purchase. The company boasts a database of 40 million individuals, in 22 million households in the UK

What would happen if someone could aggregate and automatically analyse all of your phone, bank, web, email transactions? Of course, they are all collected separately. Would their integration be different from just the sum of these parts? This is where modern Pattern Analysis by means of intelligent software starts making a game-changing contribution. The combination of statistics, artificial intelligence, and efficient algorithms, can detect subtle trends and anomalies, patterns and also make predictions about future behaviour.

 

Enter Google

In July 2008 we had another taster of things to come, when a US Court ordered Google to give the log files of Youtube to Viacom, as part of a copyright infringement lawsuit. We realised that we do not own that data, while Google does, and can give it away if they see fit. How about data kept by Ebay, Tesco, and so on?

The log file, which was to be handed to Viacom, contains the log-in ID of users, their computer IP address (online identifier) and video file details. Later a settlement between the parts was reached out of court, that (hopefully) avoided this handover of data. Until the next challenge…

Besides the clear demonstration that we do not own nor control our trail in transaction space, this incident compels us to think about another – uncomfortable – question. What would be the portrait of users that can be built at Google?

The corporate mission of Google is “to organise all the information in the world”. Already they provide (for free) web search, email, book search, calendar, videos, online document storage, online photographs storage, and much more. For many of these services you have to give them your name and email address, and for the services you pay for, you also need to give your address and banking information.

Collecting all the information in the world under the same roof is a powerful concept, but it can cut in many ways. While we can be comfortable with having a unique ID number at the online chess club, and also another unique ID at the recruitment agency, we may be very uneasy about linking the two, enabling a flow of information between those two parts of our life that we would like to keep separate.

Do we have the right to control who has access to which aspect of our personal information?

Google has direct access to our news preferences, web searches, emails, calendar, favourite online videos, and so on. So it can be used as an easy example to understand something that has been going on for a long time – in a much less visible way: the practice of collecting and trading personal information by direct marketing and credit scoring agencies, among others. Of course, data collection is also done by law enforcement, for entirely different reasons, and with a focus on different data.

Could Google (or any analogous company) slowly turn into a big-brother, keeping track of its users, and deciding what information these users will even become aware of?

Online logs are only a small slice of what is collected and analysed every single day. Of course, every single telephone call is logged, and mobile phone logs include location information too, as well as time, caller-receiver, duration, etc. Bank transactions are no different. Airplane tickets are the same.

As mentioned, exploiting information involves at least four steps: gathering, storing, processing and acting upon. Each of these steps has been automated in the past decade.

The combination of multiple sources of information, for later analysis by computers, is the third of the four new ingredients that are changing the privacy equation in the digital age.

Google has also acquired a company, Doubleclick, whose business is to track the behaviour of users over multiple partner websites. Connecting the behaviour of a user when shopping for holidays, with the behaviour of the same user when reading the news, or searching for a house, can multiply the power of the inferences that can be drawn about them.

 

Inferences

The analysis of personal information in large datasets is a powerful way to make predictions, or detect obvious anomalies, and the aggregation and fusion of multiple data sources makes the approach much more effective.

The next step is that of inference: using computers and statistics to spot subtle trends, to draw conclusions based on the data, to make predictions. So I might not have any information about your ethnic background, but I might have your postcode, which I can use to make an educated guess. In call centres, the operator can have a clear assessment of your economic status just based on your postal data, and adjust their behaviour accordingly. Credit scoring agencies sell an index, representing roughly how risky you are as a debtor.

Even non-personal information can reveal a lot about us, and this is also something we do not easily reason about. People ‘like us’ behave a lot like us, and so accurate inferences are possible based on information about them. This is how our POST CODE provides accurate information about location and census area of a small group of households, which can then be easily connected to demographic and economic census information, yet we reveal it happily to any shop-keepers who request it.

If we check the preferences of people living near the Watershed Cinema of Bristol, where this talk is taking place, by using upmystreet.com, we find that “many of the people who live in this sort of postcode [BS1 5TX] will be cosmopolitan sharers or students living in flats”. We also find that they are likely to have a high interest in current affairs, and they tend to be young, single people renting small one or two bedroom flats. While news may be followed online, they are also readers of The Guardian, Observer, Independent and The Times

If we change postcode, and we move just a short distance from here [BS2 9JN], we see a different picture. There they rent their small, one and two bedroom flats from the council and housing associations, and there is a high degree of overcrowding. Unemployment levels are high and a significant proportion of the population have been looking for work for some time. Employment tends to be in low skilled occupations and incomes are low.

So much of our life is written in our postcode. And yet we give it away to anyone who would ask. When we interview perspective employees, we are very careful never to ask any personal question, in order to avoid discrimination, but we may know people’s post code.

Inferences drawn based on geographic area are very powerful, but marketing people can do even more than that.

 

Political Profiling

In the 2008 US presidential campaign, both candidates have been making use of Voters Databases. These are systems based on direct-marketing technology, where information from the electoral lists is merged with commercially-available databases of consumers created for marketing purposes. Of course voters are also consumers, and are also in the credit scoring system, and are also in many other databases that can essentially be bought or rented for a fee.

So both parties have created their system “VoteBuilder” for the Democrats, and “VoterVault” for the Republicans. The Chief technical Office of Catalist is Vijay Ravindran, a veteran of Amazon.com.

It works like this. A large set of voters is interviewed about their opinions and concerns, with respect to the incoming election. Then they are profiled by using the available commerce data. A data-mining model is created, to link customer profiles to voter opinions. Finally, the model is applied to the entire population of voters, for which the commerce data is available, and their most likely opinions and concerns are predicted.

In this way, one can identify not only good targets for direct mailing of phone calls, but also the most appropriate message. So even 2 people living in the same neighbourhood and of a similar demographic type, can receive entirely different messages from the same campaign. Just the message that they are most likely to respond to. They can be asked to vote for the same candidate, but for two entirely different reasons.

A single mother working part time and shopping at Costco will have different political preferences than a high flying executive subscribing to The Economist. More: she will respond to different messages. Demographics and even location may not allow to distinguish between two voters in such a fine detail.

This is called micro-targeting, and apparently it makes a big difference in modern campaigns. It allows campaigns to tailor their messages down to the household level.

While using consumer data in business is hardly new, its application to political campaigning was an innovation of the 2004 Bush re-election campaign.

 

Computer Voyeur

The fourth aspect that is changing the way we should think about privacy concerns the usage of personal information. In other words: after gathering, storing, and analysing the information, one could argue that the real damage for the citizens depends on the usage of that information. As soon as a human analyst, or a neighbour, becomes aware of our medical or economic or other information, we could argue, we suffer a damage in reputation. We may also be targeted by marketing, or loose opportunities of employment, and so on. What if humans are never in the loop?

What if it is an algorithm that reads all our personal information, and makes the decisions? What if our personal information is used only by algorithms? Would we be embarrassed by an Artificial Intelligence agent ‘knowing’ the most intimate aspects of our life? Would you prefer a person or software to see your personal information, or would it be the same for you?

This is already happening, with Google Mail reading each of our emails, and selecting targeted ads to display next to each of them, based on keyword matches. On the one hand, we know that no person has read our email. On the other hand, we know that software is reading each line of it, in order to select the ads.

If we worry about the judgment of people we will never see and we justify our need for privacy with the need to avoid this judgment, then we should prefer this. If we worry instead about the usage of personal information – not the judgment – then it should make no difference for us.

The fact that we are having a hard time thinking about this new development shows how fast things are changing in privacy protection, and how hard it will be for laws to follow.

Could we see customised electoral messages based on the opinion content of our emails? Technically this is very feasible, and is done for other products than candidates. Would this be legal? Probably. Would we find it acceptable?

 

Tomorrow’s World

Today’s technology allows us to collect and exploit a vast number of diverse data, about individuals and groups. Much of the data exploited is collected for commercial purposes, some of it is even public information, yet the combination and analysis of all these data creates a very intimate and personal portrait of our life. While this might enable increased user customisation of services and one can always imagine emergencies that require access to all sorts of information, we need to keep in mind that we are venturing into a completely unexplored world – and there is no going back.

We are creating a new type of society, where the notion of privacy is very different from what we are used to and therefore expect. The conceptual framework we use to think about personal data, anonymity and analysis of behavioural patterns is changing because the situation on the ground has changed. Furthermore, current laws are based on concepts that no longer apply to the current situation.

It is interesting that: while we are becoming increasingly subjected to data surveillance, every effort is made to preserve the illusion of privacy and autonomy. We never see log files, most citizens do not know about credit scoring, and so on. It seems that we would be most disturbed if we lost this delusion of living a private and autonomous existence. Rather than creating one, we are creating the illusion of one.

As we sleep-walk irreversibly into this new world, we should develop concepts, laws and values, to help us exploit all that information technology has to offer us, without creating a nightmare for our children. It is our job as scientists to understand the implications of what we are doing, and it is our job to explain to the public and to lawmakers where our work can lead us.

 

References and Further Reading

 

– BONUS SECTION –

Law Enforcement

And how about law enforcement? They have access to the same technology and to more data than anyone else. So the general discussion of what is possible in terms of behavioural modelling applies to this domain too. But this is where an engineer should stop talking, and law-makers should start.

The House of Lords issued a report, “Surveillance: Citizens and the State” claiming that Britain leads the world in the use of CCTV, with an estimated 4 million cameras, and in building a national DNA database including more than 7% of the population.

The news recently reported that UK police has the power to search the content of computers from remote. IT IS technically possible. It has also been reported that the search of computers is also allowed when entering the US.

And in separate news, the media have reported that the private sector will be asked to manage and run a communications database that will keep track of everyone\’s calls, emails, texts and internet use under a key option contained in a consultation paper to be published in early 2009 by Jacqui Smith, the home secretary.

Law enforcement is obviously very interested in surveillance of transaction space, as well as of physical space, and in connecting the two.

In the US, in 2002-2003 there was a programme called ‘Total Information Awareness’ aimed at integrating vast amounts of surveillance data in order to detect threats. The programme has been discontinued due to concerns about it resulting into a mass surveillance system, although many of its components still exist as part of different programmes. Some of its components included the analysis of social networks created using telecommunication data, ‘human identification at a distance’, as well as text and speech analysis technology, data mining and various biological activities.

The analysis of transactions is not the entire story. Surveillance technology allows to gather information in the streets – from security or traffic cameras – with automatic reading of registration plates, and to some limited extent, recognition of faces. Other traits, such as gait analysis, are object of intense research.

Then there is the entire DNA collection topic, which has been the subject of a recent ruling by the European Union. We will leave this topic aside today.

What to do about it, is a matter of civil rights adaptation, for the next decade. Are privacy, anonymity, confidentiality, autonomy basic civil / human rights? Can they be waived under various conditions? What kind of society are we creating for the information age, and is the collective entity that we are forming going to begin oppressing us? Systems can arise without anyone actually designing them.

My own hunch is that Big Brother, if he comes to the United  States, will turn out to be not a greedy power-seeker but a  relentless bureaucrat obsessed with efficiency” (Vance Packard, 1966)

 

[1] The legalities of this are not my concern; they, will be discussed by Andrew Charlesworth. I am going to discuss only what is technically possible. This – in my opinion – should form the basis for any conceptual framework to think about personal information.

 

 

 

Scientific Method and Patterns in Data

How statistics married algorithmics and in the process changed the scientific method

Nello Cristianini

[Notes for my talk at “Information Beyond Shannon” held in Venice, December 29th 2008]

 

Introduction

In the summer of 1609, nearly exactly 400 years ago, Galileo Galilei was here in Venice, trying to sell his telescope to the Doge, in return for tenure. He had not really invented it, as this was the creation of Dutch spectacle-maker Hans Lipperhey, of which he had heard a description. He greatly developed it, and offered it to the Venetian Fleet as an aid to navigation and early detection.

During 1609 Galileo perfected his lens grinding skills, experimenting with methods and designs. He created various models and analysed the principles behind optics. As a result of this investigation, the quality of the tool was greatly enhanced, and new designs became possible.

He could have started a business making telescopes, or magnifying glasses, or spectacles. He could have been satisfied with the wage he received from the Republic of Venice.

But Galileo was a scientist, not just a tool maker. Although he did design, create, and test some of the best tools of his time, he was not just concerned with the engineering aspects of his work, and the commercial opportunities. As a true scientist, he was interested in understanding the world around him, something that would get him into trouble more than once.

In the summer of 1609, at age 45, he turned the telescope to the sky, and started his investigation of the Moon. He discovered mountains and valleys, by observing their changing shadows. Most importantly he discovered that the Moon – contrary to Aristotle’s opinion – was not a perfect sphere. Something was wrong with the established model of the Universe.

Later on he discovered with the telescope that Jupiter was orbited by 4 Moons, and this showed that in at least one case, things did not revolve just around the earth. Then with the same tool he discovered that Venus has phases, just like our Moon.

In fact, he realised, Aristotle was wrong, the Earth and Venus and Jupiter orbited the sun, the Moon orbited the Earth, like the 4 Moons of Jupiter orbited their planet. And the Moon – at least – was not a perfect sphere, but had mountains, and he could measure their height by measuring their shadows, and predict which of them would come out of the dark first, every month. What he had been taught was wrong.

His work was published really fast, in March 1610, in a short booklet entitled “Sidereus Nuncius” (Starry Messenger). This work was important not because it had direct implications on how we did things on Earth – although that too – but because eventually was responsible for a fundamental revolution of our thinking. Its implications were theological, and landed him into trouble with the Church, among other things. His observations forced him to question the received wisdom, and this is always an act of challenge, although one that is expected of scientists. These implications were also philosophical, and methodological.

In fact, that was a very early example of modern systematic scientific investigation. A scientific instrument was used to make observations, mathematical relations were derived for the geometry on the Moon, and predictions were used as a way to validate the models.

For this and many other contributions, Galileo is associated with a major shift in scientific method, which is the very topic I would like to discuss today.

 

Scientific Method.

The systematic way we use to derive and represent unambiguous knowledge, so that it has predictive and explanatory power over the world, is a major achievement of our culture. Not all cultures focused on a systematic approach to knowledge acquisition and revision, see for example the Romans. There are many ways of knowing the world, and this is a systematic organised process to produce knowledge that is reliable, and remove that which is not.

Over the centuries, we have started gathering knowledge in an organised process, involving a cycle of experiment design and hypothesis generation, representing the results – wherever possible – in unambiguous mathematical terms. This has been the accepted way in which we do science for the past few centuries, but is not the only possible way.

In fact, the scientific method has been in constant evolution for a long time. The same can be said of the practices we follow as a research community, with anonymous peer review and publication of results a crucial part of the ritual of science.

Observations lead to competing models, and this leads to experiments, and their outcomes are used to revise the current models, and this in turn suggests new experiments, and so on, in a loop. The discovery of mechanical laws can be seen in this light, with competing intuitions about masses, accelerations and frictions, leading to key experiments. In most cases, these loops are much more complex and interconnected, but the interactive nature of the modelling process is often very visible.

But things are changing fast. Now the process is going through a sort of “industrial revolution”.

 

Data are gathered automatically, by computers or even robots, effectively acting as massive measurement apparatuses, replacing what were for Galileo the thermometer or the clock. Increased accuracy and ubiquity of measurement devices result in vast repositories of experimental data, stored in dedicated disk farms.

We can look at the examples of Physics, Molecular Biology, Drug Design and Astronomy. They all exemplify the same trend in science.

The Large Hadron Collider at CERN is a machine designed to produce experimental data, potentially 15 Petabytes per year. The engineering challenges in producing, storing and managing this amount information have reached epic proportion. But it is the analysis of this data that is truly mind boggling. And this experiment can be seen – in a way – as the direct descendent of physical experiments initiated 400 years ago by Galileo: the systematic investigation of the basic laws of nature has led us to this point.

Similar challenges are encountered by today’s biology. The direct descendents of Mendel’s painstaking collection of genetic inheritance data are experiments aimed at the full sequencing of thousands of genomes at once. Terabytes of data are now produced by each of the new generation of sequencing machines, and the Sanger Centre in Cambridge is now working on the 1000 genomes project. Hundreds of species have now been fully sequenced, and we are well down the road of comparing multiple complete sequences within the same species.

In drug design, it is standard to test compounds to see if they bind to a given target, by exhaustively testing entire libraries of chemicals, by use of robots, in what is called combinatorial chemistry. Hundreds of thousands of compounds can be generated and tested, either by using robotics, or – increasingly – by computer simulations, in what is essentially a survey of entire regions of chemical space, hunting for compounds with a given set of properties.

Astronomy – another child of Galileo’s – is now done by automatic surveys of the night-sky run by computers, and by subsequent automatic analysis of the images and data gathered in this way. One such project is the Sloan Digital Sky Survey (SDSS) created a 5-wavelength catalogue over 8,000 square degrees of the sky, containing about 200 million objects, described by hundreds of features (data released incrementally to the public).

The SDSS used a dedicated 2.5-meter telescope at Apache Point Observatory, New Mexico. The 120-megapixel camera imaged 1.5 square degrees of sky at a time, about eight times the area of the full moon. A pair of spectrographs fed by optical fibers measured spectra of (and hence distances to) more than 600 galaxies and quasars in a single observation.

The database generated over 8 years by this automated survey is several Terabytes large, presenting serious challenges to data management and mining.

In fact, this point is the key of this discussion, and is common to all the examples mentioned above. There is no way that people can analyse the data produced at LHC, or at Sanger Centre, or by sky surveys. They can only be conceived because we can rely on computers to do the analysis of data for us.

And this is the point I want to make: our scientific method has changed. The revolution is not a matter of detail, or even quantity. It is a matter of quality. We have industrialised both the production and the analysis of experimental data. We have industrialised the generation of scientific knowledge.

The automatic analysis of patterns in data, the automatic generation of hypotheses, are fundamental parts of science. This is how computer science, statistics, maybe artificial intelligence, are finding their way to the core of all science, and to the core of how we know our world. This is how they are at the centre of a revolution that will have significant consequences.

 

A Newer Method.

The automatic analysis of data, in search for significant – if elusive – patterns, is now a key part of many scientific experiments, and this is an increasing trend.

Statistics and computer science, and the convergence of dozen of smaller disciplines, create a conceptual and technical framework and body of knowledge that I call Pattern Analysis. It includes tools to extract significant information from networks, images, strings, text, bio-sequences, vectors, time series….

The information created and manipulated by machines today is not the same information we study in our introductory courses of Information Theory.

We may think that the process of scientific discovery will not be fully automated until machines will be able to generate complete theories of a domain, with their formalism and equations.

This deserves 2 fundamental responses: 1) this is not out of reach for machines 2) this is not necessary for machines to be doing science.

I will focus only on point 2. We all think that the output of a scientific investigation such as Einstein’s should be a set of equations, and their interpretation, that can be used to work out predictions or models, for specific outcomes and specific experiments. We focus a lot on analytic manipulations of these general equations, as an example of abstract knowledge manipulations.

But this is not necessary to science. The output of the scientific process does not need to be a set of differential equations – although this is what we have come to expect from Physics. These equations are useful only when they are applied to particular systems and situations, and calculus is used to specialise them to these situations. Then simple mathematics is often used to make predictions about the prediction of systems under various conditions.

Being able to make these predictions is the only reason why we have these equations, and we represent knowledge in the language of functions and equations because that was the best way to represent and compute in the past.

What if we had a computer that can make the same predictions without needing to start from a set of high level equations, but instead starting from a set of relations discovered in data?

Just as these equations derive their meaning from their use, one could argue that predictive patterns discovered in data could play a similar role.

Besides, it is quite possible for machines to summarise these patterns in compact theories, only to deduce them back when needed from the basic axioms. But would that be useful?

When was it in history that we started considering ‘explained’ a phenomenon when we had a few differential equations describing its dynamics? It surely must have started in mechanics, perhaps with Newton. But these equations are ultimately combined together, and with observation of initial conditions, in order to derive predictions. What if we could just derive predictions from initial conditions and knowledge that is represented in a different way?

Patterns extracted from data can reliably be used to make predictions – just ask Amazon.com or Google.com – without the need to formulate the knowledge contained in them in the form of differential equations or equivalent unified theoretical description.

We could even mention the role of large scale computer simulations, in this respect, as a way to both make predictions and to test hypotheses.

But what matters is that at the centre of this paradigm shift is our capability to gather, store, manage and analyse massive amounts of data automatically. And this is the permanent marriage between statistics and computer science – and many other sub-disciplines – that we are discussing today.

These tools, just like Galileo’s telescope, were perhaps not derived for doing science, but very often for doing business. But just like Galileo, we can turn them and use them to change the way we understand our world.

And the fact that we are using off-the-shelf hardware to produce data, and to manage and store it, and we are using commercial software to analyse it, can only signal that further accelerations are to be expected, as costs are driven down.

 

Media Analysis.

My little research group makes extensive use of pattern analysis technologies that were originated for practical or industrial applications. We enjoy developing them. But we do it for a scientific reason too. We are currently experiencing our share of new-science, by turning our attention to another type of sphere, much like Galileo did with the Moon. We are looking at the contents of the global media-sphere, to understand how something so common and under everybody’s eyes actually works. This just cannot be done without the right tools.

We are interested in observing (and modelling) how ideas flow and interact, as they traverse the media system; every outlet can pick and choose whichever news it wants to carry; each user can choose whichever outlet they want to read; complex dynamics regulates the resulting process of information selection and diffusion; but simple patterns emerge, if we look in the right place.

But reading the contents of the mediasphere, for a machine, means being able to understand human language – at least to some extent. And this is a totally new ingredient that we can add to the mix, today. Machines can actually read and understand text. We are translating every day from 22 languages, and we are reading 1100 news outlets. We have found 450K named entities, for example, exhibiting a perfect power law of popularity, and interesting relations such as a 3-fold extra interest in the Pope found in Spanish language media over English language media.

We are detecting text re-use, with massive scale implementations of suffix trees, and tracking memes as they spread through the 10K outlets forming the global media sphere. We are recreating social networks, and detecting biases in the choice of topics and words in various types of outlets. We even measure readability.

Social scientists have been interested in understanding the media system for decades, but their investigations could only be performed by hand, on limited number of outlets, time spans, and topics. A true constant monitoring of all outlets and all topics in all languages is now within reach, and automatic analysis tools are becoming available.

 

Publishing.

But there is even more in store, when it comes to changing our scientific practice. The data revolution is also changing the way we publish our results. Peer review has been around for a few centuries, and has been a very important tool in the way of objective and reliable publication of results. But it is only one of the many ways in which this can be done. Google-rank type of scores could be used to assess the significance of contributions.

The year before becoming lost-at-sea, in 2005, Turing award winner Jim Gray was touring the US giving his lecture on the future of science, that he saw as being e-science. He gave his talk at UC Davis, when I was there, and I was impressed by how he managed to connect what is happening to geography to what is happening to biology and any other science. “Any science X now has a computational-X version” he said, and this affects the way we gather and analyse data, as well as the way in which we share data and results.

The examples of pub-med and Genbank could be followed by other sciences, in the future, with a tight integration of results, data and methods, sharing and globally creating a single unified resource.

 

Conclusions.

The scientific method is today evolving faster than ever. The automation, systematisation and industrialisation of information gathering and analysis, are accelerating the rate at which we expand our knowledge of the world. Machines now produce knowledge about our very own biology. The proportions of this transition should not be underestimated, and the science of patterns, information and knowledge is at the centre of this storm.

Galileo Galilei could have kept on making hi-tech tools and gadgets, and would certainly have found enough customers to make a comfortable living. But he was a scientist, and he used them to understand the world around him. In the process he used mathematical representations of the laws that he discovered, experiments to gather data, and overall very modern methodology. He also got into trouble with the authorities, because he refused to keep his telescope aimed low enough, and refused to ignore what he saw with it.

A new generation of scientists, with a new generation of tools, can now do the same, and gather unprecedented types of data, and draw far reaching conclusions about our world. The automatic collection of data in genomics, chemistry, astronomy, physics and also the social sciences, will revolutionise the way we see our world, and will further an understanding of it as a single interconnected system.

But we also need to keep in mind another aspect of Information Beyond Shannon: the same combination of novel data gathering and data generation techniques is not just changing the scientific method, but also our very notion of privacy. Surveillance technology, combined with data analysis, form a powerful and unsettling mix. The costs are so low now, and the technology so widespread, that we have to conclude that it is not just the way we do science that is changing. Some more fundamental notions, involving our individual rights, are also evolving fast and law-makers have not been able to keep up.

As we develop newer and stronger and cleverer tools to squeeze information out of inexpensive databases formed by today’s technology, we should also feel an obligation to develop a strong conceptual framework to think about privacy and possible abuses of this technology.

 

References.

 

  • Kevin Kelly (evolution of scientific method)

 

  • Cory Doctorow, on Nature

 

  • Sanger Center