NameDS

Introduction

The purpose of the dataset is to provide a dataset for testing and evaluating name matching and search functions.

The problem of name matching is to understand whether two strings identify the same named entity regardless of name variations and variants.

The problem of entity name search consists in using the name as keyword for searching an entity. The input of a name search function is a user defined query, which should represent a name or its variations or variants, and must return a list of entities matching the input query, ordered by some similarity measure.

Another approach to name search, is to search for an entity using a partial name as query. In this case we talk about autocompletion search: the input should be the initial characters of a name, and the output the list of entities which name matches this prefix.

Name Variation and Variants

The NameDS dataset contains name variants and name variations of entities obtained by combining together different datasets. All the names have been manually splitted in tokens and annotated with the corresponding part of name.

A variation is a change in the name orthography/or and format caused by linguistic and/or cultural factors.

E.g. Name: John Smith, Variations: Jon Smith, Smith John

A name variant is an alternative name, underlying different aspects of the same entity:

E.g. Name:Farook Bulsara, Variant:Freddy Mercury (pseudonym)

For a complete account of variants and variations please refer to:

Name Tokenization

Each name in the dataset is provided with its "part-of-name" list: each token is enriched with semantic knowledge about what is its role in the name. Adjacent tokens belonging to the same part-of-name are joint together.

The list of part of names available is different for each entity type, as Table 2 summarizes.

Table 1. Tokenization Examples
Name	Tokens
John Smith	John [given-name], Smith [family-name]
Garda Lake	Garda [proper-name], Lake [feature-class]
University of Trento	University of [class], Trento [proper-name]

Table 2. Part of Name per Entity Type
Entity Type	Parts of Name
Person	Given Name, Family Name, Middle Name, Qualifier
Location	Proper Name, Feature Class, Qualifier
Organization	Proper Name, Class, Scope, Company Type

Source Datasets

In this section we present the structure and the content of our dataset. The requirements for our dataset were:

The dataset must contain complex names (at least three tokens), plus their variants/ variations
The dataset should account for different nationalities, i.e., different languages
The dataset should contain entities of various types, especially Person, Location, Facility and Organization

Therefore, the following datasets are used as source for our own dataset:

JRC-Names, 89.625 Person Entities + 3.821 Organization Entities;
World Gazetteer, 44262 Location Entities; (© by Stefan Helders www.world-gazetteer.com)
Open Data Piemonte, Medie-Grandi Strutture di Vendita, 8051 Organization Entities.

We randomly chose around 600 entity names per type of entity from these three sources, and we randomly generated the name variations inserting misspellings and format variations in the original names, using the following heuristics:

Misspellings were added in the names, adding, deleting, editing or switching characters
Deletion of some tokens
Changed position of tokens
Token replaced by their initial letter.

We manually split the names in tokens, and tagged each token with the part of name it represent.

The dataset is divided in 3 sub dataset, one for each entity type, and each of their characteristics are summarizes by Table 3.

Table 3. Statistics about the Dataset
	Person	Location	Organization	TOTAL
Entities	500	635	566	1701
Names	2019	2044	1895	5958
Name Variant	673	681	566	1920
Name Variations	1346	1363	1129	3838
Average Tokens (per name)	1.85	1.07	1.73	1.49

Ontology

We designed an ontology for representing the relation between entities, names and their tokens. We tried to reuse as much as possible already existing vocabulary from other ontologies, but we had to add some new terms. In particular we defined our entity types (Person, Location and Organization), and the token types which were not yet defined (middle name, proper name, class, qualifier, scope).

We also need to represent the relation between a name and its variant and variations. The relation of variant name is represented using swandr:alternativeTo, while a name variation is connected to its original name with owl:sameAs.

Here is the list of the main ontologies that we use to design the dataset:

Dataset for Name Match and Search Evaluation

In the dataset we also provide data for evaluating name matching and search functions. Moreover, the dataset is designed for evaluating name match function, name search, and autocompletion name search. The input for each of these functions is in a separated ontology file.

The input for the match function is a list of match pairs and their expected result; the , e.g. 'New York', 'Neww Yorrk', true. In the ontology a positive match (expected result is true), is represented with the owl:sameAs relation, while for a negative match (expected result is false) we use owl:differentFrom.
Example:

	nameDS:A103374
	      foaf:name "Petra Ulmanen" .

	nameDS:C103374
	     foaf:name "Petra Uqrlmanen" .
	
	nameDS:C125338
	     foaf:name "Onilo S" .

	nameDS:A103374
	      owl:differentFrom nameDS:C125338 ;
	      owl:sameAs nameDS:C103374 .

For the name search function, we provide for each name a subset of the queries that should be able to retrieve it. Similarly to name match pairs, the relation between the name and its queries is represented with owl:sameAs. Example:

	nameDS:A103374
	      owl:sameAs "Petr aUlmanekn" , "Petra Uqrlmanen" .

The name autocompletion function (also called prefix search) differs from the previous name search by the expected type of input: name search assumes that the input is a "complete" query, while autocompletion expects only the first part of a name (the user is still writing the query). Therefore, we listed, for each name, the possible prefixes (autocompletion queries) that should retrieve it. The relation between a name and its prefix is owl:sameAs. Example:

	nameDS:A103374
	      owl:sameAs "ulm" , "petra" , "ulma" , "ulman" , "petr" , "ul" , 
		  "pe" , "ulmane" , "ulmanen" , "pet" .

Table 4 summarizes the characteristics of the resulting 3 datasets.

Table 4. Statistics about the Dataset
	Person	Location	Organization	TOTAL
Match Pairs (positive)	2000	2000	1621	5621
Match Pairs (negative)	2000	2000	2000	6000
Search Queries	1342	1331	1104	3777
Autocompletion Queries	4122	4650	6653	?

Browsing the Dataset

The dataset is described using TURTLE format. Downloadable files:

ontology.rdf: ontology definition for the dataset
Person dataset person.zip
Location dataset location.zip
Organization dataset organization.zip

Each zip archive archive contains 3 files:

[type].ttl: file containing all the names, their variations for that entity type (all names are provided with their tokenization),
[type]_match.ttl: file containing match pair list,
[type]_search.ttl: file containing name search queries,
[type]_prefix.ttl: file containing name autocompletion search queries.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Authors

This dataset was thought, designed, annotated and distributed by:

For any request relating to this dataset, please contact pane@disi.unitn.it.