Project completed and this blog text edited by: Mauro Bonizzi, Simone Terzi and Andrea Gulfo.

The goal of our work was to:

  • Creating Interest Descriptions: Interest descriptions are the user-specified interest names augmented with semantically related words which are mined from the Wikipedia ontology.
  • Computing Interest Feature Vectors (IFV). Then, we derive the IFV of each user which quantifies the interest of a user in each topic.

We wrote a python program that allows:

  • Recover hobbies from mysql db. To do this you have used the following query:
    “SELECT DISTINCT dataRich from rawdata where length(dataRich) > 2”,
    where the WHERE clause allows you to retrieve records significant.
  • For each tuple extracted were recovered hobby name and his category.
  • After this operation was carried out an http request to Wikipedia for obtain the abstract of hobby previously extracted.
  • The request is as follows:
    https://en.wikipedia.org/w/api.phpaction=query&prop=extracts&format=json
    &redirects&exlimit=max&explaintext&exintro&titles=name_of_hobbies
    .

    In detail, the response obtained is in json format (format = json).
    The parameter redirects has been used to solve automatically redirects from the given title.
    Exlimit parameter allows you to specify more titles (hobbies) within a single HTTP request.
    Title contains a list of up to 20 hobbies separated by a pipe, allowing to obtain better performance than sending a single request at a time.
    Then we saved for every hobby its description previously recovered into a file.

  • To further improve the performance of the program, we saved inside a hashset hobbies extracted from database and Wikipedia. This allows you to search for hobby not duplicate.

Mallet allows text classification and information extraction.
Mallet returns three different types of output:

  1. Docs in topics
  2. Topics words
  3. Topics in docs

We used the “Topics in docs” file to build the vector of interest.
The aforementioned file contains in each row:

  1. a doc id that identifies the line number
  2. top topics that represents the id of the topic
  3. contribution to doc represents the probability of occurrence of the topic in the
    document

To create the vector of interest we have extracted from the output of the mallet the probability of occurrence for each topic.
This last is the “popularity” of the topic under consideration.

The vector is used to express interest for every hobby which are the most common topics.
For reasons of privacy were masked IDs of each users using fake id.

We have created a file containing the friendship between two users by exploiting the information in the database.
Even in this case to preserve the privacy we used a fake id.

Advertisements