Sematifying Delicious

Introduction

The tags2con dataset has been manually created by a group of human annotators that linked del.icio.us tags to their real meaning.

A subset of a delicious dump has been used to create the tags2con dataset, a set of 1681 user-bookmarks pairs have been selected and all the tags used by these pairs have been manually cleaned and disambiguated to WordNet synsets.

The dataset we have built includes annotations from users which have less than 1000 tags and have used at least ten different tags in five different website domains. This upper bound was decided considering that Delicious is also subject to spamming, and users with more than one thousand tags could potentially be spammers or machine generated tags as the original authors of the crawled data assumed.

Furthermore, only user-bookmark pairs that have at least three tags (to provide diversity in the golden standard), no more than ten tags (to allow timely manual evaluation) are selected. Only URLs that have been used by at least ten users are considered in order to provide enough overlap between users. After retrieving all the user-bookmark pairs that comply with the previously mentioned constraints, we randomly selected 1681 pairs with the following method: 500 pairs were selected purely at random, 1181 pairs were selected randomly at equal distribution in the pairs that overlaped with one of the following DMOZ topics: Top/Home/Cooking, Top/Recreation/Travel or Top/Reference/Education. Table 1 summarizes the characteristics of the resulting subset of the dataset.

Table 1. Statistics about the Dataset
Item Count
<r, u> pairs 1681
total number of tags 7323
average number oftags per pair 4.35
unique tags 1569
unique URLs 739
unique users 1397
website domains 603

Each tag has been split into lemmatized tokens and each of them has then been linked to its meaning in the WordNet 3.0 ontology

Browsing the Dataset

All resource identifiers defined by this dataset are dereferenceable.

The dataset is described using VOID, either in N3 format or in RDF format.

Some examples of the dataset can be found at:

We also provide some entry points to the set of instances in the dataset:

License

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Dataset Vocabulary

rdf
Figure 1: RDF Mapping for the Semantified Delicious dataset.

While the Newman's tagging ontology and the SCOT extension can represent the tripartite graph model of folksonomies, they do not discriminate between a tag and a concept.

In Figure 1, we propose an extension to the Newman's ontology where a tags:Tag can be split in a number of tags2con:Token that then link to the actual semantic in a knowledge organisation system (in this case a SKOS:Concept) that can be used in reasoning. In this proposal, for compatibility with the existing RDF models that widely use the Newman's tags:Tag class, we also use this one. However, it is our belief that this creates a confusion between the linguistic layer of the folksonomy and its conceptual layer that can lower the accuracy of reasoning services based on this data. Thus, we would recommend to drop such compatibility in the future.

In addition to the tags2con extension (rdf, n3, Ontology Browser), the main ontologies that we use to distribute the dataset are:

Agreement

In order to guarantee the correctness of the assignment of tag splits and tag token senses, two different validators validated all tags of each user-bookmark pair.

The "agreement without chance correction" between users in the task of disambiguation of tokens is of 0.81.

Some Statistics

In the current distribution, we only include the annotations were two validators have agreed on the sense and split of the tags.

The current dataset contains:

Authors

This dataset was thought, designed, annotated and distributed by:

  1. Pierre Andrews
  2. Juan Pane
  3. Ilya Zaihrayeu

For any request relating to this dataset, please contact andrews@disi.unitn.it.

Publications

You can cite:

Linked Open Data (LOD) Cloud

tags2con in the LOD Cloud

The dataset that we are distributing here is following the Linked Data principles and also tries to fulfill the requirements of the LOD Cloud:

In addition, the tags2con dataset has been registered on CKAN.

Tools for Annotation

Upcoming: The tools that were used to create this dataset will be available as open source soon

With the support of:

http://www.dit.unitn.it/~knowdive/
insemtives.eu
FP7-231181