Skip to main content

Trees and Tweets: Mining Billions to Understand Human Migration and Regional Linguistic Variation

The project is funded by Digging into Data Challenge research grant (2014-2016), led by Dr. Diansheng Guo (University South Carolina, U.S.),  Dr. Alice Kasakoff (University South Carolina, U.S.) and Dr. Jack Grieve (Aston University, U.K.). The project team also consists of Caglar Koylu, Yuan Huang and Chao Chen, and Andrea Nini.

The goal of the project is to collect, clean, and analyze big data of family trees and tweets for the understanding of large-scale migration patterns and language variation in the US and UK. Our research will curates and analyzes large, dynamic, and accurately geocoded data; bridges two distinctive domains (language and migration) usually studied by separate disciplines; and produces both data and methodologies that will facilitate research for a broad range of topics in the humanities and social sciences.


With more and more historical data sources being digitized and shared, interest in genealogy has grown rapidly among both academic scholars and the general public. Moreover, the general public has been using the available digital sources to trace their ancestors, build family trees, and share the trees on genealogy websites., one of the world’s largest genealogy company, has accumulated information about millions of individuals in user-contributed family trees. Locations and dates of birth, death, and residences are available for many individuals as well as for their descendants.


Similarly, linguists have begun to exploit online data to better understand language variation and change. One particularly large data source comes from Twitter. As of March in 2012, there are 200 million active twitter users worldwide and 340 million tweets published per day. The most advantageous aspects of tweets are their accurate time, location and large quantity of information. Unlike most written data, tweets would also appear to be more similar to spontaneous speech. Moreover, by analyzing the location of tweets  over a time period, we can infer human daily mobility patterns.