Introduction
The purpose of the dataset is to provide a dataset for testing and evaluating name matching and search functions.
The problem of name matching is to understand whether two strings identify the same named entity regardless of name variations and variants.
The problem of entity name search consists in using the name as keyword for searching an entity. The input of a name search function is a user defined query, which should represent a name or its variations or variants, and must return a list of entities matching the input query, ordered by some similarity measure.
Another approach to name search, is to search for an entity using a partial name as query. In this case we talk about autocompletion search: the input should be the initial characters of a name, and the output the list of entities which name matches this prefix.
Name Variation and Variants
The NameDS dataset contains name variants and name variations of entities obtained by combining together different datasets. All the names have been manually splitted in tokens and annotated with the corresponding part of name.
A variation is a change in the name orthography/or and format caused by linguistic and/or cultural factors.
A name variant is an alternative name, underlying different aspects of the same entity:
For a complete account of variants and variations please refer to:
Name Tokenization
Each name in the dataset is provided with its "part-of-name" list: each token is enriched with semantic knowledge about what is its role in the name. Adjacent tokens belonging to the same part-of-name are joint together.
The list of part of names available is different for each entity type, as Table 2 summarizes.
Name | Tokens |
---|---|
John Smith | John [given-name], Smith [family-name] |
Garda Lake | Garda [proper-name], Lake [feature-class] |
University of Trento | University of [class], Trento [proper-name] |
Entity Type | Parts of Name |
---|---|
Person | Given Name, Family Name, Middle Name, Qualifier |
Location | Proper Name, Feature Class, Qualifier |
Organization | Proper Name, Class, Scope, Company Type |
Source Datasets
In this section we present the structure and the content of our dataset. The requirements for our dataset were:
- The dataset must contain complex names (at least three tokens), plus their variants/ variations
- The dataset should account for different nationalities, i.e., different languages
- The dataset should contain entities of various types, especially Person, Location, Facility and Organization
Therefore, the following datasets are used as source for our own dataset:
- JRC-Names, 89.625 Person Entities + 3.821 Organization Entities;
- World Gazetteer, 44262 Location Entities; (© by Stefan Helders www.world-gazetteer.com)
- Open Data Piemonte, Medie-Grandi Strutture di Vendita, 8051 Organization Entities.
We randomly chose around 600 entity names per type of entity from these three sources, and we randomly generated the name variations inserting misspellings and format variations in the original names, using the following heuristics:
- Misspellings were added in the names, adding, deleting, editing or switching characters
- Deletion of some tokens
- Changed position of tokens
- Token replaced by their initial letter.
We manually split the names in tokens, and tagged each token with the part of name it represent.
The dataset is divided in 3 sub dataset, one for each entity type, and each of their characteristics are summarizes by Table 3.
Person | Location | Organization | TOTAL | |
---|---|---|---|---|
Entities | 500 | 635 | 566 | 1701 |
Names | 2019 | 2044 | 1895 | 5958 |
Name Variant | 673 | 681 | 566 | 1920 |
Name Variations | 1346 | 1363 | 1129 | 3838 |
Average Tokens (per name) | 1.85 | 1.07 | 1.73 | 1.49 |
Ontology
We designed an ontology for representing the relation between entities, names and their tokens. We tried to reuse as much as possible already existing vocabulary from other ontologies, but we had to add some new terms. In particular we defined our entity types (Person, Location and Organization), and the token types which were not yet defined (middle name, proper name, class, qualifier, scope).
We also need to represent the relation between a name and its variant and variations. The relation of variant name is represented using swandr:alternativeTo, while a name variation is connected to its original name with owl:sameAs.
Here is the list of the main ontologies that we use to design the dataset:
Dataset for Name Match and Search Evaluation
In the dataset we also provide data for evaluating name matching and search functions. Moreover, the dataset is designed for evaluating name match function, name search, and autocompletion name search. The input for each of these functions is in a separated ontology file.
The input for the match function is a list of match pairs and their expected result; the , e.g. 'New York', 'Neww Yorrk', true. In the ontology a positive match (expected result is true), is represented with the owl:sameAs relation, while for a negative match (expected result is false) we use owl:differentFrom.
Example:
nameDS:A103374 foaf:name "Petra Ulmanen" . nameDS:C103374 foaf:name "Petra Uqrlmanen" . nameDS:C125338 foaf:name "Onilo S" . nameDS:A103374 owl:differentFrom nameDS:C125338 ; owl:sameAs nameDS:C103374 .
For the name search function, we provide for each name a subset of the queries that should be able to retrieve it. Similarly to name match pairs, the relation between the name and its queries is represented with owl:sameAs. Example:
nameDS:A103374 owl:sameAs "Petr aUlmanekn" , "Petra Uqrlmanen" .
The name autocompletion function (also called prefix search) differs from the previous name search by the expected type of input: name search assumes that the input is a "complete" query, while autocompletion expects only the first part of a name (the user is still writing the query). Therefore, we listed, for each name, the possible prefixes (autocompletion queries) that should retrieve it. The relation between a name and its prefix is owl:sameAs. Example:
nameDS:A103374 owl:sameAs "ulm" , "petra" , "ulma" , "ulman" , "petr" , "ul" , "pe" , "ulmane" , "ulmanen" , "pet" .
Table 4 summarizes the characteristics of the resulting 3 datasets.
Person | Location | Organization | TOTAL | |
---|---|---|---|---|
Match Pairs (positive) |
2000 | 2000 | 1621 | 5621 |
Match Pairs (negative) | 2000 | 2000 | 2000 | 6000 |
Search Queries | 1342 | 1331 | 1104 | 3777 |
Autocompletion Queries | 4122 | 4650 | 6653 | ? |
Browsing the Dataset
The dataset is described using TURTLE format. Downloadable files:
- ontology.rdf: ontology definition for the dataset
- Person dataset person.zip
- Location dataset location.zip
- Organization dataset organization.zip
Each zip archive archive contains 3 files:
- [type].ttl: file containing all the names, their variations for that entity type (all names are provided with their tokenization),
- [type]_match.ttl: file containing match pair list,
- [type]_search.ttl: file containing name search queries,
- [type]_prefix.ttl: file containing name autocompletion search queries.
License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
Authors
This dataset was thought, designed, annotated and distributed by:
For any request relating to this dataset, please contact pane@disi.unitn.it.