Monthly Archives: May 2009

Patterns in Personal Data

The Privacy Delusion

When we play computer-chess online, we do not expect our intelligence to be measured and compared with our school records, and we certainly do not expect those records to be sold to recruitment agencies.

As we shop for holidays or extravagant shoes, we do not expect this information to be given to debt collection agencies or credit-scoring organisations. After all, our everyday world does not work like that. Does it?

When we buy a newspaper, no one knows what articles we end up reading. Certain information is just for ourselves; like the name of the girl we liked at school or our passion for Abba music. Whether we exchange emails to organise a surprise party, search for information about a skin condition that is worrying us, or just take a walk on the beach, we do not expect our activities to be monitored. Do we?

What we do expect today is that a certain part of our life will be kept private, if for no other reason than because there are so many of us that we cannot imagine anyone making the effort to gather information about all of those things for every person. We literally hide in the crowd, relying on numbers – if not the decency of the others – to protect a little part of our personal sphere. These expectations might be misleading today or tomorrow, as the world is changing fast.

In order to be useful, information needs to be gathered, stored or transmitted, then processed and finally acted upon. Every step of this chain has undergone major transformations in the past decades, and can now be done automatically, cheaply, and very efficiently, by machines. Taken together, these technological advances have enabled a revolution in our society, but at the same time they can pose a threat to the privacy of our personal sphere. What information can be automatically gathered today (we discuss technical aspects here [1]) and how much about our personal lives can be inferred from it?

Let us take our love affair with electronic transactions. These include the use of debit cards to buy even a coffee, the automatic payment of our salaries into our accounts, the use of mobile telephones, the use of the internet to plan journeys and buy tickets, and so on.

As we conduct our life in this new and empowering digital age, we leave behind a permanent trail of personal data that is never deleted, and that is instead carefully analysed, and even traded, to model our behaviour and – in some cases – to try to influence it. Never before has so much data been collected about so many people

This talk takes a brief look at the surveillance of transaction space, from the point of view of “what is technically possible”.

Whether these data are used by humans or machines, and whether this distinction matters, will be discussed below. Let us have a first peek at a region of transaction space.


The AOL Case Study

Although most users will never be aware of that, it is standard practice of all search engines to gather and analyse a log file of all queries that each user has realised. Together with the content of the query, and the time stamp, information is collected that allows the analysts to identify the machine from which the queries have been made (cookies, IP addresses, in some cases user login information). This is done for a variety of technical reasons, but its potential was soon noticed.

Most end users will never be aware of the existence of log files, and indeed, it is not common even for researchers to have access to this information, as it is sensitive both for its commercial value and for its privacy implications.

But on August 4, 2006, AOL Research labs released a file on one of its websites containing the search log for over 650,000 users over a 3-month period, intended for research purposes. It contained about 20 million search keywords. All the transactions were anonymous, but each user was identified by a unique ID number, so that it was possible to connect queries performed by the same person, but not to identify the person.

Few days later, acknowledging this was an error, AOL removed the file from public access, but the data are still available at various internet locations, for all of us to explore. Those responsible were later fired.

Besides triggering a series of news articles and also various lawsuits, this incident gives us a rare glimpse into the – often invisible – backroom of online businesses. It allows us to have direct experience of a rather theoretical concept: the trail we leave in transaction space every day and, more generally, the amount of data that are automatically generated about so many of our daily activities.

[+] User 98280. Let us follow one particular user, just to see what information can be obtained. User 98280 is probably really a couple (2 users). There seems to be an abusive male – possibly addicted to cocaine – and a pregnant female, probably from Texas. The query log reveals a series of sessions – at different times – alternating in topic from ‘ovulation calculator’, to ‘pregnancy calendar’, to ‘first trimester of pregnancy’ to ‘effect of addictions on foetus’; with totally different queries for ‘girls gone wild’ and ‘fine black girls’; and – sadly- we also see queries like dealing with spouse that has bipolar disorder’ and spouse is cruel to everyone in family including animals’ and ‘coping with abusive spouse’ and ‘prayers for relationship problems’ or ‘healing prayers for people with bipolar disorder’ and ‘is bipolar disorder hereditary.

Slowly the story of 3 months of a pregnancy and a couple with serious problems unfolds in front of our eyes; a sad story of anxieties and problems that were never meant for public display. These searches reveal the most intimate details and anxieties, and even the intentions, of an individual who is mistakenly under the impression of being in total privacy.

[+] User 4417749. The New York Times tracked down some users, by exploiting ‘self searches’ and other give-away information. One of these users accepted to be interviewed and named: user 4417749 or Thelma Arnold, a 62-year-old Georgian widow. She ran a series of searches for the names of her family members, and for her neighbourhood, so that it was easy for reporters to track her down and interview her. Of course, others can be identified, but did not want to be named. But the issue of naming users is really a false one as most search engines have the information anyway, if you just login once. Furthermore, other searches are connected to your name, or your address, and those can allow you to be identified also if done when you are logged out What is important, here, is how many intimate details we are prepared to reveal to a search engine…

[+] User 1227287. There are also problem users, like 1227287, who is searching for bomb making instructions, as well as related information.

[+] User 927. One particularly disturbing search history – at the other end of the spectrum – is that of user 927, who appears tobe a very sick person, with an obsession for child abuse. A theatre play was even named after this user. How typical are 927’s queries? One could answer by performing a complete analysis of the 20 million entries of the file, to identify the most and least typical users. There are difficult questions about legal responsibilities of any analysts becoming aware of such search patterns. What should an analyst do, if likely criminal activity is suspected?

[+] User 637988. The question is posed very well by AOL user 637988 with the query “what is the ethical responsibility of the therapist when a student mentions suicide”.

If a person was to share with their analyst or doctor their own most disturbing fantasies, would the doctor be expected to call the police? In which cases? It does look like search engines are used as confessors, as counsellors, as doctors…. The expectation of privacy component is therefore a key issue to address. The data collected by a search engine amounts to a catalogue of intentions, mixed with a catalogue of fantasies, and a catalogue of curiosities, all bundled together.

In the example of the AOL query log, we only have anonymous search queries over a 3-months period, and yet we can discover so much about the intimate life of so many people. But search engines keep data for a much longer period than three months., They often also have the name of their users, as well as their address and banking details, if they provide e-shopping services.

Furthermore, much personal data is today easily available to various organisations. Some is private and is gathered as a routine part of business. Other is publicly available, or can be purchased (or rented) like any other commodity.

Do you want to buy a list of 10,000 alcohol-drinking, pet-owning, frequent flyers from the UK? Names and home addresses? Many companies can help you. Consumer Response, for example, will charge you a basic rate of 1,700 pounds for a list, plus extras for each extra attribute of the consumers you want to purchase. The company boasts a database of 40 million individuals, in 22 million households in the UK

What would happen if someone could aggregate and automatically analyse all of your phone, bank, web, email transactions? Of course, they are all collected separately. Would their integration be different from just the sum of these parts? This is where modern Pattern Analysis by means of intelligent software starts making a game-changing contribution. The combination of statistics, artificial intelligence, and efficient algorithms, can detect subtle trends and anomalies, patterns and also make predictions about future behaviour.


Enter Google

In July 2008 we had another taster of things to come, when a US Court ordered Google to give the log files of Youtube to Viacom, as part of a copyright infringement lawsuit. We realised that we do not own that data, while Google does, and can give it away if they see fit. How about data kept by Ebay, Tesco, and so on?

The log file, which was to be handed to Viacom, contains the log-in ID of users, their computer IP address (online identifier) and video file details. Later a settlement between the parts was reached out of court, that (hopefully) avoided this handover of data. Until the next challenge…

Besides the clear demonstration that we do not own nor control our trail in transaction space, this incident compels us to think about another – uncomfortable – question. What would be the portrait of users that can be built at Google?

The corporate mission of Google is “to organise all the information in the world”. Already they provide (for free) web search, email, book search, calendar, videos, online document storage, online photographs storage, and much more. For many of these services you have to give them your name and email address, and for the services you pay for, you also need to give your address and banking information.

Collecting all the information in the world under the same roof is a powerful concept, but it can cut in many ways. While we can be comfortable with having a unique ID number at the online chess club, and also another unique ID at the recruitment agency, we may be very uneasy about linking the two, enabling a flow of information between those two parts of our life that we would like to keep separate.

Do we have the right to control who has access to which aspect of our personal information?

Google has direct access to our news preferences, web searches, emails, calendar, favourite online videos, and so on. So it can be used as an easy example to understand something that has been going on for a long time – in a much less visible way: the practice of collecting and trading personal information by direct marketing and credit scoring agencies, among others. Of course, data collection is also done by law enforcement, for entirely different reasons, and with a focus on different data.

Could Google (or any analogous company) slowly turn into a big-brother, keeping track of its users, and deciding what information these users will even become aware of?

Online logs are only a small slice of what is collected and analysed every single day. Of course, every single telephone call is logged, and mobile phone logs include location information too, as well as time, caller-receiver, duration, etc. Bank transactions are no different. Airplane tickets are the same.

As mentioned, exploiting information involves at least four steps: gathering, storing, processing and acting upon. Each of these steps has been automated in the past decade.

The combination of multiple sources of information, for later analysis by computers, is the third of the four new ingredients that are changing the privacy equation in the digital age.

Google has also acquired a company, Doubleclick, whose business is to track the behaviour of users over multiple partner websites. Connecting the behaviour of a user when shopping for holidays, with the behaviour of the same user when reading the news, or searching for a house, can multiply the power of the inferences that can be drawn about them.



The analysis of personal information in large datasets is a powerful way to make predictions, or detect obvious anomalies, and the aggregation and fusion of multiple data sources makes the approach much more effective.

The next step is that of inference: using computers and statistics to spot subtle trends, to draw conclusions based on the data, to make predictions. So I might not have any information about your ethnic background, but I might have your postcode, which I can use to make an educated guess. In call centres, the operator can have a clear assessment of your economic status just based on your postal data, and adjust their behaviour accordingly. Credit scoring agencies sell an index, representing roughly how risky you are as a debtor.

Even non-personal information can reveal a lot about us, and this is also something we do not easily reason about. People ‘like us’ behave a lot like us, and so accurate inferences are possible based on information about them. This is how our POST CODE provides accurate information about location and census area of a small group of households, which can then be easily connected to demographic and economic census information, yet we reveal it happily to any shop-keepers who request it.

If we check the preferences of people living near the Watershed Cinema of Bristol, where this talk is taking place, by using, we find that “many of the people who live in this sort of postcode [BS1 5TX] will be cosmopolitan sharers or students living in flats”. We also find that they are likely to have a high interest in current affairs, and they tend to be young, single people renting small one or two bedroom flats. While news may be followed online, they are also readers of The Guardian, Observer, Independent and The Times

If we change postcode, and we move just a short distance from here [BS2 9JN], we see a different picture. There they rent their small, one and two bedroom flats from the council and housing associations, and there is a high degree of overcrowding. Unemployment levels are high and a significant proportion of the population have been looking for work for some time. Employment tends to be in low skilled occupations and incomes are low.

So much of our life is written in our postcode. And yet we give it away to anyone who would ask. When we interview perspective employees, we are very careful never to ask any personal question, in order to avoid discrimination, but we may know people’s post code.

Inferences drawn based on geographic area are very powerful, but marketing people can do even more than that.


Political Profiling

In the 2008 US presidential campaign, both candidates have been making use of Voters Databases. These are systems based on direct-marketing technology, where information from the electoral lists is merged with commercially-available databases of consumers created for marketing purposes. Of course voters are also consumers, and are also in the credit scoring system, and are also in many other databases that can essentially be bought or rented for a fee.

So both parties have created their system “VoteBuilder” for the Democrats, and “VoterVault” for the Republicans. The Chief technical Office of Catalist is Vijay Ravindran, a veteran of

It works like this. A large set of voters is interviewed about their opinions and concerns, with respect to the incoming election. Then they are profiled by using the available commerce data. A data-mining model is created, to link customer profiles to voter opinions. Finally, the model is applied to the entire population of voters, for which the commerce data is available, and their most likely opinions and concerns are predicted.

In this way, one can identify not only good targets for direct mailing of phone calls, but also the most appropriate message. So even 2 people living in the same neighbourhood and of a similar demographic type, can receive entirely different messages from the same campaign. Just the message that they are most likely to respond to. They can be asked to vote for the same candidate, but for two entirely different reasons.

A single mother working part time and shopping at Costco will have different political preferences than a high flying executive subscribing to The Economist. More: she will respond to different messages. Demographics and even location may not allow to distinguish between two voters in such a fine detail.

This is called micro-targeting, and apparently it makes a big difference in modern campaigns. It allows campaigns to tailor their messages down to the household level.

While using consumer data in business is hardly new, its application to political campaigning was an innovation of the 2004 Bush re-election campaign.


Computer Voyeur

The fourth aspect that is changing the way we should think about privacy concerns the usage of personal information. In other words: after gathering, storing, and analysing the information, one could argue that the real damage for the citizens depends on the usage of that information. As soon as a human analyst, or a neighbour, becomes aware of our medical or economic or other information, we could argue, we suffer a damage in reputation. We may also be targeted by marketing, or loose opportunities of employment, and so on. What if humans are never in the loop?

What if it is an algorithm that reads all our personal information, and makes the decisions? What if our personal information is used only by algorithms? Would we be embarrassed by an Artificial Intelligence agent ‘knowing’ the most intimate aspects of our life? Would you prefer a person or software to see your personal information, or would it be the same for you?

This is already happening, with Google Mail reading each of our emails, and selecting targeted ads to display next to each of them, based on keyword matches. On the one hand, we know that no person has read our email. On the other hand, we know that software is reading each line of it, in order to select the ads.

If we worry about the judgment of people we will never see and we justify our need for privacy with the need to avoid this judgment, then we should prefer this. If we worry instead about the usage of personal information – not the judgment – then it should make no difference for us.

The fact that we are having a hard time thinking about this new development shows how fast things are changing in privacy protection, and how hard it will be for laws to follow.

Could we see customised electoral messages based on the opinion content of our emails? Technically this is very feasible, and is done for other products than candidates. Would this be legal? Probably. Would we find it acceptable?


Tomorrow’s World

Today’s technology allows us to collect and exploit a vast number of diverse data, about individuals and groups. Much of the data exploited is collected for commercial purposes, some of it is even public information, yet the combination and analysis of all these data creates a very intimate and personal portrait of our life. While this might enable increased user customisation of services and one can always imagine emergencies that require access to all sorts of information, we need to keep in mind that we are venturing into a completely unexplored world – and there is no going back.

We are creating a new type of society, where the notion of privacy is very different from what we are used to and therefore expect. The conceptual framework we use to think about personal data, anonymity and analysis of behavioural patterns is changing because the situation on the ground has changed. Furthermore, current laws are based on concepts that no longer apply to the current situation.

It is interesting that: while we are becoming increasingly subjected to data surveillance, every effort is made to preserve the illusion of privacy and autonomy. We never see log files, most citizens do not know about credit scoring, and so on. It seems that we would be most disturbed if we lost this delusion of living a private and autonomous existence. Rather than creating one, we are creating the illusion of one.

As we sleep-walk irreversibly into this new world, we should develop concepts, laws and values, to help us exploit all that information technology has to offer us, without creating a nightmare for our children. It is our job as scientists to understand the implications of what we are doing, and it is our job to explain to the public and to lawmakers where our work can lead us.


References and Further Reading



Law Enforcement

And how about law enforcement? They have access to the same technology and to more data than anyone else. So the general discussion of what is possible in terms of behavioural modelling applies to this domain too. But this is where an engineer should stop talking, and law-makers should start.

The House of Lords issued a report, “Surveillance: Citizens and the State” claiming that Britain leads the world in the use of CCTV, with an estimated 4 million cameras, and in building a national DNA database including more than 7% of the population.

The news recently reported that UK police has the power to search the content of computers from remote. IT IS technically possible. It has also been reported that the search of computers is also allowed when entering the US.

And in separate news, the media have reported that the private sector will be asked to manage and run a communications database that will keep track of everyone\’s calls, emails, texts and internet use under a key option contained in a consultation paper to be published in early 2009 by Jacqui Smith, the home secretary.

Law enforcement is obviously very interested in surveillance of transaction space, as well as of physical space, and in connecting the two.

In the US, in 2002-2003 there was a programme called ‘Total Information Awareness’ aimed at integrating vast amounts of surveillance data in order to detect threats. The programme has been discontinued due to concerns about it resulting into a mass surveillance system, although many of its components still exist as part of different programmes. Some of its components included the analysis of social networks created using telecommunication data, ‘human identification at a distance’, as well as text and speech analysis technology, data mining and various biological activities.

The analysis of transactions is not the entire story. Surveillance technology allows to gather information in the streets – from security or traffic cameras – with automatic reading of registration plates, and to some limited extent, recognition of faces. Other traits, such as gait analysis, are object of intense research.

Then there is the entire DNA collection topic, which has been the subject of a recent ruling by the European Union. We will leave this topic aside today.

What to do about it, is a matter of civil rights adaptation, for the next decade. Are privacy, anonymity, confidentiality, autonomy basic civil / human rights? Can they be waived under various conditions? What kind of society are we creating for the information age, and is the collective entity that we are forming going to begin oppressing us? Systems can arise without anyone actually designing them.

My own hunch is that Big Brother, if he comes to the United  States, will turn out to be not a greedy power-seeker but a  relentless bureaucrat obsessed with efficiency” (Vance Packard, 1966)


[1] The legalities of this are not my concern; they, will be discussed by Andrew Charlesworth. I am going to discuss only what is technically possible. This – in my opinion – should form the basis for any conceptual framework to think about personal information.