NameDS

Introduction

The purpose of the dataset is to provide a dataset for testing and evaluating name matching and search functions.

The problem of name matching is to understand whether two strings identify the same named entity regardless of name variations and variants.

The problem of entity name search consists in using the name as keyword for searching an entity. The input of a name search function is a user defined query, which should represent a name or its variations or variants, and must return a list of entities matching the input query, ordered by some similarity measure.

Another approach to name search, is to search for an entity using a partial name as query. In this case we talk about autocompletion search: the input should be the initial characters of a name, and the output the list of entities which name matches this prefix.

 

Name Variation and Variants

The NameDS dataset contains name variants and name variations of entities obtained by combining together different datasets. All the names have been manually splitted in tokens and annotated with the corresponding part of name.

A variation is a change in the name orthography/or and format caused by linguistic and/or cultural factors.

E.g. Name: John Smith, Variations: Jon Smith, Smith John

A name variant is an alternative name, underlying different aspects of the same entity:

E.g. Name:Farook Bulsara, Variant:Freddy Mercury (pseudonym)

For a complete account of variants and variations please refer to:

Name Tokenization

Each name in the dataset is provided with its "part-of-name" list: each token is enriched with semantic knowledge about what is its role in the name. Adjacent tokens belonging to the same part-of-name are joint together.

The list of part of names available is different for each entity type, as Table 2 summarizes.

Table 1. Tokenization Examples
Name Tokens
John SmithJohn [given-name], Smith [family-name]
Garda LakeGarda [proper-name], Lake [feature-class]
University of TrentoUniversity of [class], Trento [proper-name]

Table 2. Part of Name per Entity Type
Entity Type Parts of Name
Person Given Name, Family Name, Middle Name, Qualifier
Location Proper Name, Feature Class, Qualifier
Organization Proper Name, Class, Scope, Company Type

Source Datasets

In this section we present the structure and the content of our dataset. The requirements for our dataset were:

  1. The dataset must contain complex names (at least three tokens), plus their variants/ variations
  2. The dataset should account for different nationalities, i.e., different languages
  3. The dataset should contain entities of various types, especially Person, Location, Facility and Organization

Therefore, the following datasets are used as source for our own dataset:

  1. JRC-Names, 89.625 Person Entities + 3.821 Organization Entities;
  2. World Gazetteer, 44262 Location Entities; (© by Stefan Helders www.world-gazetteer.com)
  3. Open Data Piemonte, Medie-Grandi Strutture di Vendita, 8051 Organization Entities.

We randomly chose around 600 entity names per type of entity from these three sources, and we randomly generated the name variations inserting misspellings and format variations in the original names, using the following heuristics:

We manually split the names in tokens, and tagged each token with the part of name it represent.

The dataset is divided in 3 sub dataset, one for each entity type, and each of their characteristics are summarizes by Table 3.

Table 3. Statistics about the Dataset
Person Location Organization TOTAL
Entities 500 635 566 1701
Names 2019 2044 1895 5958
Name Variant 673 681 566 1920
Name Variations 1346 1363 1129 3838
Average Tokens (per name) 1.85 1.07 1.73 1.49

Ontology

We designed an ontology for representing the relation between entities, names and their tokens. We tried to reuse as much as possible already existing vocabulary from other ontologies, but we had to add some new terms. In particular we defined our entity types (Person, Location and Organization), and the token types which were not yet defined (middle name, proper name, class, qualifier, scope).

We also need to represent the relation between a name and its variant and variations. The relation of variant name is represented using swandr:alternativeTo, while a name variation is connected to its original name with owl:sameAs.

Here is the list of the main ontologies that we use to design the dataset:

Dataset for Name Match and Search Evaluation

In the dataset we also provide data for evaluating name matching and search functions. Moreover, the dataset is designed for evaluating name match function, name search, and autocompletion name search. The input for each of these functions is in a separated ontology file.

The input for the match function is a list of match pairs and their expected result; the , e.g. 'New York', 'Neww Yorrk', true. In the ontology a positive match (expected result is true), is represented with the owl:sameAs relation, while for a negative match (expected result is false) we use owl:differentFrom.
Example:

	nameDS:A103374
	      foaf:name "Petra Ulmanen" .

	nameDS:C103374
	     foaf:name "Petra Uqrlmanen" .
	
	nameDS:C125338
	     foaf:name "Onilo S" .

	nameDS:A103374
	      owl:differentFrom nameDS:C125338 ;
	      owl:sameAs nameDS:C103374 .
	  

For the name search function, we provide for each name a subset of the queries that should be able to retrieve it. Similarly to name match pairs, the relation between the name and its queries is represented with owl:sameAs. Example:

	nameDS:A103374
	      owl:sameAs "Petr aUlmanekn" , "Petra Uqrlmanen" .
	

The name autocompletion function (also called prefix search) differs from the previous name search by the expected type of input: name search assumes that the input is a "complete" query, while autocompletion expects only the first part of a name (the user is still writing the query). Therefore, we listed, for each name, the possible prefixes (autocompletion queries) that should retrieve it. The relation between a name and its prefix is owl:sameAs. Example:

	nameDS:A103374
	      owl:sameAs "ulm" , "petra" , "ulma" , "ulman" , "petr" , "ul" , 
		  "pe" , "ulmane" , "ulmanen" , "pet" .
	

Table 4 summarizes the characteristics of the resulting 3 datasets.

Table 4. Statistics about the Dataset
Person Location Organization TOTAL
Match Pairs
(positive)
2000 2000 1621 5621
Match Pairs (negative) 2000 2000 2000 6000
Search Queries 1342 1331 1104 3777
Autocompletion Queries 4122 4650 6653 ?

Browsing the Dataset

The dataset is described using TURTLE format. Downloadable files:

Each zip archive archive contains 3 files:

License

Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.

Authors

This dataset was thought, designed, annotated and distributed by:

  1. Enrico Bignotti
  2. Stella Margonar
  3. Juan Pane
  4. Enzo Maltese

For any request relating to this dataset, please contact pane@disi.unitn.it.

With the support of:

http://www.dit.unitn.it/~knowdive/