Note that an improved version of this article has been published in Natural Hazards Observer, Volume XXXVI, Number 4, pp. 7-9, March 2012.
Vasileios Lampos and Nello Cristianini
Intelligent Systems Laboratory
University of Bristol
Abstract. Real time monitoring of environmental and social conditions is an important part of developing early warning of natural hazards such as epidemics and floods. Rather than relying on dedicated infrastructure, such as sensor networks, it is possible to gather valuable information by monitoring public communications from people on the ground. A rich source of raw data is provided by social media, such as Blogs, Twitter or Facebook. In this study we describe two experiments based on the use of Twitter content in the UK, showing that it is possible to detect a flu epidemic, and to assess the levels of rainfall, by analysing text data. These measurements can in turn be used as inputs of more complex systems, for example for the prediction of floods, or disease propagation.
The fast expansion of the social web that is currently under way means that large numbers of people can publish their thoughts at no cost. Current estimates put the number of Facebook users at 800 million and of Twitter active users at 100 million [1, 2]. The result is a massive stream of digital text that has attracted the attention of marketers , politicians  and social scientists . By analysing the stream of communications in an unmediated way, without relying on questionnaires or interviews, many scientists are having direct access to people’s opinions and observations for the first time. Perhaps equally important they have access – although indirectly – to situations on the ground that affect the web users, such as for example extreme weather conditions, as long as these are mentioned in the messages being published.
The analysis of social media content is a statistical game, as there is no guarantee that a specific user will describe the weather state in her current location when we need it. But by gathering a large amount of messages from a given location, and by monitoring the right keywords and expressions, it is possible to obtain indirect statistical evidence in favour of a given weather state. In this article we describe two experiments that we have conducted by using Twitter content in the United Kingdom, showing that it can be used to infer the levels of rainfall or of influenza-like-illness (ILI) in a given location, with significant accuracy. The enabling technology behind this study is Statistical Learning Theory, a branch of Artificial Intelligence concerned with the automatic detection of statistical patterns in data.
The use of Twitter data is particularly convenient because its users can only exchange very short messages that are often geo-located, and because this data is freely available via an API . Furthermore the use of this data does not raise the serious privacy concerns that would be raised by the analysis – say – of email or SMS messages, as this is all data that the users have willingly made public.
We believe that the kind of signal that we can extract from that textual stream can be of interest in its own right, and be a valuable input to more complex modelling software, aimed at the prediction of epidemics or floods, as well as other hazards.