Project completed and this blog text edited by: Gottardello Samuele, Ketoglo E. Eyiram, Lopa Alessandro and Micheletti Fabio.
We developed a software to analize the friendship network of a single user without geographical informations, to obtain an estimation of its position from a statistical analysis on those of his Facebook friends.
To do this we used the following user’s informations:
username – user ID
location – the location specified by the user
locale – user language identifier
The work has been divided into 3 phases:
1. location data cleaning
2. development of the inferring algorithm
3. results data anonymization
For this work we rebuilt Facebook users graph using Python library IGRAPH.
1. Data Cleansing
Facebook geographical data are often expressed in formats not comparable to each other so we developed a technique to solve this problem.
We decided to convert all locations in a unique format using geographic coordinates ( latitude and longitude), as well as a text string.
To do this we decided to use the geocoding service provided by Google ( Geocoder V3 link), by using Geopy, a python library.
This service provides the geocoding functionalities also used by Google Maps, allowing to obtain all the data we needed by sending a text string that indicates the position.
Geocoder has a daily limit of 2500 API calls; to overcome this we collected all facebook unique locations in a hashmap and then we converted them.
2. Development of the Inferral algorithm
To determine a user’s location the algorithm:
– Selects a node without a location from the ego network of a user with a location;
– Estimates the most common location in this node’s ego network, this will be its new location.
The node A has 10 nodes as neighbors, including 4 with location USA, 3 with ITALY and 3 without location.
The algorithm detects two possible locations for A, USA and ITALY, and then evaluates which is more frequent in A’s ego network.
Since USA is repeated 4 times and ITALY only 3, for Node A is inferred USA as its new location.
We obtained a results graphical representation using Pygmaps, a python library that allows to create HTML files containing a map using Google Maps service.
It was decided to generate two maps where we indicated the graph users location before and after the application of the inferring function.
To avoid overlapping of the points with same coordinates, we applied a function that adds a random offset on duplicated inferred coordinates.
3. Data Anonymization
To publish our results, data were subjected to a anonymization process.
In particular, it was decided to anonymize Facebook userids and the more detailed part of the location, for instance streets and cities.
Results were recorded in a csv file with 4 columns:
USER_ID | LOCATION | EXISTS | LOCALE
USER_ID is the user anonymous ID;
LOCATION is the user’s anonymous location;
EXISTS indicates what information is contained within the location;
LOCALE (not always present) is the corresponding user’s locale.
Starting with about 650,000 users, only 15776 have a location and from the latters, we have been inferred 10761 users locations.
It should be noted that it’s possible to infer a location for the remaining nodes, starting from the nodes whose location has already been inferred.