Full article: A deep neural network model for Chinese toponym matching with geographic pre-training model

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Multiple tasks within the field of geographical information retrieval and geographical information sciences necessitate toponym matching, which involves the challenge of aligning toponyms that share a common referent. The multiple string similarity approaches struggle when confronted with the complexities associated with unofficial and/or historical variants of identical toponyms. Also, current state-of-the-art approaches/tools to supervised machine learning rely on labeled samples, and they do not adequately address the intricacies of character replacements either from transliterations or historical shifts in linguistic and cultural norms. To address these issues, this paper proposes a novel matching approach that leverages a deep neural network model empowered by geographic language representation model, known as GeoBERT, which stands for geographic Bidirectional Encoder Representations from Transformers (BERT). This model harnesses the groundbreaking capabilities of the GeoBERT framework by extending a generalized Enhanced Sequential Inference Model architecture and integrating multiple features to enhance the accuracy and robustness of the toponym matching. We present a comprehensive evaluation of the proposed method’s performance using three extensive datasets. The findings clearly illustrate that our approach outperforms the individual similarity metrics used in previous studies.

KEYWORDS:

1. Introduction

Address serves as a pivotal foundational asset for the intelligent development of urban areas. By means of thorough exploration and application of address data, it facilitates urban planners in achieving a more precise understanding of the city’s developmental trajectory and requirements. Consequently, this knowledge empowers them to optimize the spatial arrangement of the city and effectively distribute public services, thereby enhancing overall urban efficiency (Alsudais, Alotaibi, and Alomary Citation2022; Hu et al. Citation2022a; Ma Citation2022; Mauro, Ardissono, and Lucenteforte Citation2020; Qiu et al. Citation2022a; Santos, Murrieta-Flores, and Martins Citation2018a; Citation2018b; Zhang et al. Citation2022). Therefore, in various applications closely related to Geographic Information Retrieval (GIR) and Geographic Information System (GIS), place name matching has become an inherent challenge (Hu et al. Citation2022b; Citation2022c; Hu et al. Citation2023a; Citation2023b; Qiu et al. Citation2022b; Qiu et al. Citation2023). These include the merging of digital gazetteers or datasets related to points of interest, as well as the parsing of addresses in services for geocoding and map search, and toponym resolution concerning textual content, digitized maps, and digital library resources (Berkhin et al. Citation2015; Comber and Arribas-Bel Citation2019; Li et al. Citation2020a; Citation2020b; Moura, Davis Jr, and Fonseca Citation2017; Santos, Anastácio, and Martins Citation2015). Various tasks within the realm of GIR and the broader field of GIS necessitate the undertaking of toponym matching. This particular endeavor involves tackling the challenge of aligning toponyms that possess a shared referent, such as the instances where both ‘Wuhan’ and ‘Jiangcheng’ are employed to denote the city of Wuhan, China. Within the aforementioned array of tasks, toponym matching assumes a crucial role in mitigating ambiguity challenges inherent to the usage of different names to denote a singular geographical location. However, it is worth noting that alternative techniques are frequently investigated to address instances where identical names are employed for distinct places. Although the primary objective of toponym matching is to recognize various names that refer to the same entity, additional contextual details are frequently analyzed to resolve ambiguity when multiple instances of a particular place name are encountered.

A typical approach for matching toponyms predominantly relied on string matching techniques, approaching the problem by assessing character similarities (Santos, Murrieta-Flores, and Martins Citation2018a; Winkler Citation1990). Noteworthy examples include function-based algorithms, full-word matching methods, and fuzzy query approaches (Li et al. Citation2021; Li et al. Citation2022; Li, Feng, and Chiu Citation2023; Moreau, Yvon, and Cappé Citation2008; Recchia and Louwerse Citation2013; Varol and Bayrak Citation2012). Furthermore, some approaches of spatial geometry to explore place name matching techniques. These approaches involve constructing models that incorporate address features such as distance, location, and other spatial geometric characteristics. However, it should be noted that this approach necessitates numerous geometric operations, thereby resulting in relatively lower execution efficiency (Santos, Murrieta-Flores, and Martins Citation2018a). Presently, the prevailing research trend in the domain of place name matching algorithms predominantly revolves around the utilization of deep learning models.

Recent advancements in the field have brought forth a remarkable method for combining several string similarity measures, namely supervised machine learning techniques like deep learning. This methodology, as proposed by Santos et al. (Citation2018a), has demonstrated its effectiveness in addressing the intricacies of toponym matching. However, it is important to acknowledge that even the combined application of diverse string similarity metrics may fall short in comprehensively tackling certain arduous challenges inherent in this domain. For example, effectively managing toponym modifications that are unofficial and/or historical, which may demonstrate significant differences, presents a notable challenge. Similarly, managing transliterations involving different languages and scripts presents another complex facet. The primary reason for this is the constraints of current string similarity metrics, which rely on common character substrings to establish semantic similarity. As a consequence, these metrics inadequately capture the substitutions commonly observed in transliterations and ignore the development of language use across time. Deep learning has witnessed a surge in its utilization within the domain of GIS. Firstly, deep learning facilitates the automatic modeling of address features, circumventing the need for manual design rules. Consequently, the reliance on predefined heuristics is significantly reduced. Secondly, deep learning excels in extracting semantic information, rendering it well-suited for diverse address structures, particularly those characterized by irregularities. As a result, deep learning techniques prove to be highly effective in addressing the challenges posed by such varied address formats.

This paper proposes GeoBERTTM, an innovative deep learning model with GeoBERT designed specifically for toponym matching within natural language texts in Chinese. The proposed deep learning approach in this manuscript entails a two-step process. Initially, the GeoBERT model is harnessed to facilitate the training and acquisition of word vectors pertaining to address elements, thereby effecting the transformation of input address records into their respective vector representations. Subsequently, the enhanced sequential inference model (ESIM), one of the state-of-the-art deep text-matching approaches, is leveraged to conduct local and global inferences between the compared address records in vector format, determining whether they match. The model leverages the groundbreaking geographic pre-training capabilities of the GeoBERT framework. By extending a generalized recurrent neural network model, our approach integrates multiple features to enhance the accuracy and robustness of the toponym matching process. Significantly, this model integrates innovative improvements to proficiently address the inherent irregularities encountered in texts composed in natural language.

Section 2 provides an extensive overview of fundamental concepts and prior research. It commences with a comprehensive introduction to widely adopted string similarity metrics, followed by an in-depth analysis of prior research addressing the challenge of toponym matching through machine learning and deep learning approaches. In Section 3, the proposed deep learning methodology is explicated, highlighting the intricate details of the neural network architecture employed for assessing the similarity between pairs of toponyms. Section 4 outlines the experimental evaluation protocol, encompassing the dataset utilized, general experimental procedures, and the evaluation metrics employed. Furthermore, the obtained results are presented and thoroughly discussed. Section 5 presents the detailed discussion. Section 6 offers a summary of the most noteworthy findings, along with potential directions for future research.

2. Related work

2.1. Previous reseach on string matching

Most of the previous studies on toponym matching have primarily depended on methodologies that revolve around calculating a similarity measure between the to-be-matched toponyms. Subsequently, a determination is made based on a predefined threshold applied to the resulting similarity score. Traditional approaches to toponym matching center around employing string similarity metrics like the Levenshtein edit distance (Levenshtein Citation1966) and the Jaro-Winkler metric (Winkler Citation1990), alongside the subsequent establishment of manual thresholds or tailored classifiers (Sun et al. Citation2013). These techniques primarily concentrate on assessing textual associations between address records, predominantly through simplistic word-for-word or character-for-character comparisons. Generally, these approaches can be categorized into three distinct groups: methods based on characters (e.g. Levenshtein Citation1966; Sun et al. Citation2013; Winkler Citation1990), methods based on vector space (e.g. Eidoon, Yazdani, and Oroumchian Citation2008; Li, Wang, and Mei Citation2010), and hybrid methods (e.g. Santos et al. 2017; Cheng, Liao, and Chen Citation2022). Character-based techniques primarily function by utilizing edit operations at the character level. Conversely, vector-space methods enable the conversion of strings into vector representations, on which similarity calculations are performed. Hybrid approaches ingeniously amalgamate both strategies to enhance the efficacy of matching names comprised of multiple tokens. Presently, deep learning has emerged as the prevailing method for toponym matching, superseding other alternatives.

Character-based methods involve the evaluation of character edit distances to measure similarity. Among these methods, measure for Levenshtein edit distance (1966) stands out as a widely recognized and esteemed approach. Buckles, Buckley, and Petry (Citation1994) conducted a series of operations, namely stem extraction, alignment, and synonym substitution, on strings, subsequently employing fuzzy retrieval for address string matching at a predetermined precision level. Amir et al. (Citation2009) presented an effective algorithm that addresses natural rearrangement errors by accommodating cases of matching with content consistency despite incorrectly positioned characters. However, these methods typically involve character-level comparisons, which can become inefficient when handling extensive datasets. Consequently, they are often combined with other techniques to enhance their performance.

In the realm of vector-space methodologies, a commonly employed strategy entails computing the cosine similarity metric between representations constructed using character n-grams, which encompass consecutive sequences of n characters. Eidoon, Yazdani, and Oroumchian (Citation2008) developed an ontology model based on vector space representation and achieved ontology similarity estimation by comparing concept vectors. The successful application of this approach to ontology alignment assessment occurred in 2005. In a related study, Li, Wang, and Mei (Citation2010) introduced a matching algorithm named Term Frequency-Inverse Document Frequency (TF-IDF), utilizing the commonly employed feature weight calculation method in the vector space model. To address the challenge of matching and rectifying addresses across distinct standardized address databases, they proposed a matching algorithm called Term-Weighted Dissimilarity. However, it is worth noting that when handling extensive datasets, the vector space model can potentially lead to issues such as the curse of dimensionality or matrix sparsity due to each dimension representing a word or feature.

Hybrid methodologies that integrate both of these approaches have emerged, offering a flexible treatment of word order and position while accommodating slight variations in word tokens. In their study, Santos et al. (2017) conducted experiments utilizing supervised machine learning to effectively merge multiple similarity metrics. This approach offers a notable advantage by eliminating the need for manual tuning of similarity thresholds. Similarly, Koumarelas et al. (Citation2018) proposed a method that enriches address matching through a combination of geocoding, reverse geocoding, and similarity metrics. By augmenting each record with comprehensive address information and selecting suitable similarity metrics for individual address attributes, they aimed to enhance classifier performance in terms of F-values. Cheng, Liao, and Chen (Citation2022) employed a multifaceted approach by integrating various similarity metrics, including string similarity, semantic similarity, and spatial similarity. This comprehensive strategy was pursued to enhance the accuracy of address matching.

While traditional string-based matching techniques have successfully addressed numerous challenges concerning toponymic matching in earlier stages, the current era of big data and the proliferation of heterogeneous data from multiple sources present new complexities. In this context, traditional methods often prove inadequate when confronted with non-standard or intricately structured data. Moreover, they fail to adequately capture toponymic variations caused by phonetic translations or alterations in language and culture over time (Santos et al. Citation2018a, Citation2018b; Qiu et al. Citation2022a; Citation2022b). Furthermore, although these methods have exhibited a certain level of effectiveness, they necessitate meticulous calibration of the similarity threshold. It has become evident that no single technique universally outperforms others.

2.2. Previous reseach on toponym matching based on machine learning

In recent years, the advancements in machine learning have generated notable breakthroughs in natural language processing (NLP), encompassing various tasks, including address matching. This task relies on fundamental principles within the realm of NLP. The deep learning models that serve as the foundation for address matching comprise Long short-term memory (LSTM) architecture (Cheng, Dong, and Lapata Citation2016), BERT framework (Devlin et al. Citation2018), and ESIM approach (Chen et al. Citation2016). These cutting-edge frameworks have demonstrated significant potential in addressing the intricacies of address matching and hold promise for further advancements in this field.

In order to facilitate embedding processes, the existing deep learning matching models need the extraction of semantic features related to addresses. This involves various techniques, such as utilizing word2vec for vectorizing address text to capture semantic information (Lin et al. Citation2020), or utilizing codec architectures to obtain the semantic vector representation for every address string (Shan et al. Citation2020). However, it is important to consider that these approaches may impact the comprehension of address semantics. To address this issue, Zhang et al. (Citation2020) utilized a BERT model to capture contextual information within addresses, enabling effective processing of address features. They further incorporated a conditional random field (CRF) model to model the constraint relationships between labels, facilitating accurate prediction of matching outcomes. Chen et al. (Citation2021) proposed an innovative address matching model, known as Attention-BiLSTM-CNN network (ABLC) with a comparative learning approach. By combining the contextual semantic learning capabilities of bidirectional LSTM (BiLSTM), the feature extraction advantages of convolutional neural network (CNN), and the dynamic weight assignment ability of attention mechanisms, the ABLC model significantly enhances matching performance compared to baseline models. Furthermore, Xu et al. (Citation2022) introduced a deep migration learning approach to enhance the understanding of address semantics. Their method involves pre-training an address corpus and enabling the address semantic model (ASM) to learn unsupervised address context, thereby improving its capability to comprehend address semantics effectively.

As reported by Santos et al. (2018), who introduces an innovative approach for toponym matching, utilizing a deep neural network to discern whether pairs of toponyms correspond or not. The authors conduct an extensive evaluation of this method’s effectiveness, employing a substantial dataset sourced from the GeoNames gazetteer (https://www.geonames.org/). Remarkably, the findings demonstrate that the proposed approach yields notably superior results compared to individual similarity metrics employed in previous studies, as well as surpassing previous supervised machine learning methods for combining multiple metrics.

Supervised machine learning methods, especially those employing ensembles of decision trees, have demonstrated notable performance by effectively combining similarity metrics (Santos et al. 2017). Nevertheless, it is important to acknowledge that the complexity of Chinese addresses surpasses that of their English counterparts in terms of syntax and semantics. Unlike English, which relies on words and includes delimiters between them, Chinese addresses are character-based, lacking such delimiters. Moreover, due to rapid urbanization and unrestricted urban planning practices, address descriptions exhibit significant inconsistencies (Qin et al. Citation2016). For instance, a study investigating address discrepancies in Beijing identified 35 errors (Li et al. Citation2018), while Chongqing’s address data and model exhibited three distinct categories of mistakes. These challenges have necessitated the development of innovative approaches, as detailed in this article.

3. The proposed methodology

3.1. Overall framework

The overall framework of the methodology proposed in this paper is presented in . This framework encompasses two distinct stages. Firstly, the BERT model undergoes pre-training on vast-scale geo-domain texts to acquire distributed embedding representations for characters (named GeoBERT). Subsequently, the text undergoes representation mapping using a character-to-feature mapping table, which includes Pinyin, Wubi, Radicals, and Strokes. The word2Vec model is then utilized to capture contextual information and generate distributed embedding representations for each feature. In the second stage, the semantic toponym matching training is conducted on ESIM using the feature vectors generated by BERT.

Figure 1. The overall framework of the methodology proposed in this paper, which includes two steps: obtaining vector representations of input toponym and deep learning model for toponym pairs matching. Step 1 trained the vectors of toponym elements derived from the processed toponym corpus using GeoBERT and transformed the labeled toponym dataset from text to vector format, and Step 2 we separated the dataset into three groups for training, development and test, finally we trained the ESIM to evaluate the accuracy of the best performing model.

3.2. Obtaining vector representations of toponym records

As mentioned in Section 1, the toponym pairs found in the annotated toponym dataset are exclusively given in textual format. Given that a toponym comprises a contiguous sequence of toponym elements, a common method for facilitating deep learning computations is to encode each component of the toponym using a distinct vector, thereby converting the toponym into its related vector representation (). In this research, a toponym element includes the name of a toponym entity (e.g. ‘Wuhan’), a feature of the toponym model (e.g. ‘District’), or their various combinations.

Figure 2. Creating an address’s vector representation with combining multiple features (e.g. Wubi, Radicals, Pinyin and Strokes).

Word2vec facilitates the encoding of all conceivable toponym elements encompassed within the toponym dataset. Additionally, a fusion of word vectors obtained from word2vec can depict each toponym entry. However, it is crucial to acknowledge that word2vec captures the semantic significance of words according to their conventional contexts of usage. Consequently, it provides static representations of words and relies on training with a meticulously formatted text corpus. Consequently, numerous colloquial expressions may not be adequately covered by these embeddings. In such instances, vernacular terms are often represented by a generic unknown token embedding, leading to a potential loss of the precise semantics associated with the respective words.

In order to tackle this issue, we leveraged BERT, a neural network architecture, to train and obtain vector representations for toponym elements. BERT excels in its bidirectionality as it learns the contextual information of a word by considering its neighboring words (both left and right). To the existing BERT language model, general-domain corpora such as Wikipedia and the Book corpus are utilized. However, when working on domain-specific tasks, it is important to recognize that the target domain dataset’s data distribution may differ from that of the BERT model. Also, we further conducted pre-training on the BERT-base language model, which we refer to as GeoBERT, utilizing a geographic domain corpus. The objective was to capture the contextual significance of toponyms and improve the performance in the toponym matching task within the target domain. To enable the pre-training process for the BERT-base language model, we prepared a corpus consisting of 18,000,000 sentences extracted from social media texts and abstracts obtained from the training dataset. Prior to corpus preparation, we performed sentence segmentation and normalization using the Jieba library (https://github.com/fxsjy/jieba). Subsequently, we generated five different corpora through arbitrary processing for additional pre-training of the BERT model. Every generated document consists of tokens, segment IDs, masked labels, masked label positions, and a flag indicating the randomization of the next segment in the masked language model. GeoBERT provides a dynamic representation for each word by considering its usage within the corresponding sentence. This empowers the model to capture more profound semantic and syntactic implications of words in the target domain, thereby improving its performance in specific tasks within that domain.

Distinct from English, Chinese words are typically formed by characters, many of which can be further broken down into components like radicals. Characters and radicals possess abundant information and have the ability to convey semantic meanings of words. However, existing word embedding methods have not fully harnessed these resources. To address this limitation, multiple pre-trained word embeddings, including Pinyin, Wubi, Radicals, and Strokes, are employed to represent the words encompassed within a toponym. These pre-trained embeddings have undergone prior training and are capable of capturing the semantic essence of a word based on its typical co-occurrence patterns with other words.

Radicals have proven to be instrumental in extracting potential semantic relationships from unstructured textual data (Yin et al. Citation2016a, Citation2016b). This allows neural networks to acquire knowledge regarding the distinct boundaries of internal and external entities within the domain of geography by analyzing the structural composition of various Chinese characters. For instance, the character ‘氵’ bearing both semantic and structural resemblance to ‘River,’ exhibits a higher likelihood of being classified as such. To further refine the toponym classification process, we incorporate Wubi features that complement the radical features. Both Radicals and Wubi draw upon the hieroglyphic structure of Chinese characters. Additionally, variations in pronunciation significantly contribute to the semantic expression of these characters, prompting the inclusion of character Pinyin features to enhance the sequence-based attributes (Yu et al. Citation2017). In the geographic domain, Chinese characters with similar complexities in stroke formation tend to share common category labels (Cao et al. Citation2018; Su and Lee Citation2017). In this study, we adopt the positional embedding concept from BERT to expand the stroke count corresponding to each input sequence character. Subsequently, these features are employed alongside the word2vec model to train and obtain the pertinent characteristics of the text. Finally, we concatenate the acquired features belonging to the four aforementioned categories to construct the ultimate vector for the embedding layer.

3.3. Deep semantic matching of toponym pairs

Following the acquisition of vector representations for pairs of toponyms, for semantic toponym matching, we utilized the ESIM (Chen et al. Citation2016). The ESIM is a well-established deep learning model commonly employed for text matching using interaction-based methodologies (Fan et al. Citation2017; Lin et al. Citation2020). In our study, we applied the ESIM to perform local inference between toponym pairs and subsequently amalgamate this localized inference to generate an overall prediction. Throughout model training, this approach comprehensively considers interactions between related toponym items and their settings in the two toponym records under comparison. An overview the implementation of the ESIM can be found in . Moreover, a comprehensive elucidation of each layer within the ESIM is provided in the subsequent section.

Figure 3. An overview of the ESIM (Chen et al. Citation2016). The ESIM includes four layers: input encoding layer for encoding the input toponym vectors and extracting higher-level representations for toponym records, local inference modeling layer for make local inference of a toponym pair, inference composition layer for making global inference between two compared toponym records, and prediction layer for predictive results of toponym pairs using MLP.

3.3.1. Input encoding layer

Let us consider that S_a (consisting of l_a elements), and S_b (consisting of l_b elements), are two toponyms from the labeled toponym dataset. In order to facilitate further analysis, we converted S_a and S_b into corresponding toponym vectors, referred to as a_la and b_lb, respectively.

Toponym with text format, being a distinctive form of sequential data, necessitates appropriate modeling to uncover semantic associations and interdependencies among its constituent address elements during the toponym parsing process. While an LSTM model (Hochreiter and Schmidhuber Citation1997) can capture geographic element dependencies following the reading order of toponym text, it fails to consider the semantic associations among toponym elements in the reverse reading order. To address this limitation, this study adopts a BiLSTM network model to extract potential semantic associations within address text. The BiLSTM network model comprises two LSTM layers, establishing connections with both the input and output layers. By training the model on input toponym character vectors using forward and backward sequences, it utilizes hidden states and gating units to capture contextual dependencies. Afterwards, the algorithm combines and calculates the ultimate hidden states by merging the outputs of the forward and backward LSTM neural networks in order to produce a fresh encoding vector $\bar{a_{i}}$ or $\bar{b_{j}}$ , yielding character vectors arranged in both forward and backward sequences. Following this process, the score for each character in each label is obtained through concatenation and arithmetic operations.

3.3.2. Local inference modeling layer

Within this layer, we leverage a modified decomposable attention model to conduct local inference on a pair of toponyms, encompassing three pivotal stages. We perform a multiplication operation between a_i^T and b_j^T, resulting in an unnormalized attention weight matrix e_ij that captures the local inference between two compared toponym records. Subsequently, we normalize the attention weights through the softmax function and multiply them with b_j to generate the output represented by a_i. A parallel procedure is implemented to derive b_j by applying the reverse process. Lastly, the local inference of S_a with respect to S_b is aggregated by concatenating $\bar{a}$ , $\tilde{a}$ , their difference, and their dot product, resulting in m_a. Similarly, m_b is obtained by following the reverse process.

3.3.3. Inference composition layer

Within this layer, we leverage the prior local inferences of two compared toponym records to facilitate global inference. This process involves two crucial steps. Firstly, the BiLSTM model is used to derive higher representations of m_a and m_b, resulting in v_a,i and v_b,j as their respective outputs. Secondly, the v_a,i and v_b,j vectors are subjected to both maximum and average pooling procedures. The outcomes of these pooling methods are concatenated together, effectively summarizing the local inferences. Ultimately, a final vector v_final of fixed length is obtained, encapsulating the consolidated information from the pooling step.

3.3.4. Prediction layer

In this layer, a MLP is employed to generate predictive outcomes for toponym pairs, where the labels are assigned as either 0 or 1. The MLP utilized in this study comprises three fully connected layers, each equipped with activation functions such as ReLU, Tanh, and Softmax.

4. Experimental results, and discussion

In this section, we present the dataset employed to bolster the experimental analyses, outline the experimental setup implemented, and thoroughly discuss the resultant findings. The dataset plays a pivotal role in substantiating the validity and efficacy of our proposed methodologies, thereby facilitating comprehensive evaluations. By describing the experimental setup, we aim to provide readers with a clear understanding of the systematic procedures and parameters adopted throughout the study. Finally, the obtained results are critically analyzed and interpreted, shedding light on the implications and significance of our research findings.

4.1. Dataset

Our experiments rely on three datasets of 100,000 pairs of toponyms. The first dataset (named Alidataset) is from the CCKS2021 Chinese Address Correlation Academic Review Task (https://tianchi.aliyun.com/competition/entrance/531901/information), provided by Alibaba Dharma Institute. The dataset employed in this study comprises a collection of standardized addresses, addresses requiring matching, and corresponding labels. Notably, the address pairs within this dataset deviate from manually constructed conventional address pairs in several key aspects. Specifically, the address text exhibits variations, wherein no predetermined rewritten word list is implemented to construct heteronymous same-location addresses. Furthermore, the address specifications may not precisely align across different localities. Consequently, this dataset poses elevated expectations for the place name matching model in terms of its generalizability. By harnessing this unique dataset, subsequent experiments can intricately assess the model’s performance while effectively verifying its applicability in real-world scenarios. The utilization of this dataset enables comprehensive evaluations that encompass a wide range of address variations and challenges, thereby enriching the model’s overall robustness and practical feasibility.

The second original dataset is available from the Shenzhen Address Database (named SZdataset) (doi: 10.5281/zenodo.3477007) (Lin et al. Citation2020). The corpus utilized in this study encompasses an extensive collection of Chinese address records, totaling 57,253,694 entries, which were updated throughout in 2018. This comprehensive dataset effectively covers nearly all address data pertaining to the region of Shenzhen, China. To ensure the reliability and accuracy of the corpus, a series of rigorous data cleaning procedures were implemented. The data cleaning process encompassed three fundamental steps. Firstly, any duplicate address records were identified and subsequently eliminated from the corpus. Secondly, the removal of extraneous elements such as omitted letters and unique symbols (e.g. ‘.’,’`’, and ‘/’) was conducted to streamline the dataset further. Lastly, efforts were made to rectify instances where characters within the address records were incorrectly written. As a result of these meticulous data cleaning procedures, an exemplary dataset consisting of 84,474 address pairs and their corresponding labels was constructed.

We use the following steps to create a manually annotated dataset named HBdataset (see ). We crawl the Hubei POI data from the web, and then pre-process these data. The detailed construction process of this dataset was as follows.

Table 1. The labeled address dataset examples of several data rows from HBdataset.

Download CSV Display Table

Step 1: Following the web data crawling process, it becomes essential to perform thorough data cleaning, which primarily encompasses two critical stages: noise processing and removal of excessively short address data. Noise processing involves the elimination of irrelevant characters and special symbols, such as spaces, ‘@’, ‘#’, and external links that do not contribute to the intended purpose of the data. This step effectively filters out extraneous noise present in the dataset. Additionally, eliminating excessively short address data entails evaluating the string length and establishing a predetermined threshold value to discard data entries that fall below this threshold. The address data obtained subsequent to the data cleaning procedure retains its primary characteristics, thereby mitigating the adverse impact of invalid information on the matching outcomes.

Step 2: A crucial aspect that necessitates attention involves the conversion of traditional Chinese characters within the crawled web address data. As these characters may not be contained within the existing vocabulary, they can potentially introduce semantic errors during the subsequent text vectorization process. To address this concern, the HanziConv library (https://pypi.org/project/hanziconv/) is employed to convert traditional Chinese text into simplified text, ensuring compatibility for subsequent vectorization processes.

Step 3: In order to facilitate the implementation of a supervised address matching algorithm, the generation of appropriately labeled training and test datasets becomes imperative. These datasets are composed of destination addresses (address1), addresses requiring matching (address2), and corresponding matching result labels. This systematic process ensures the availability of reliable and annotated data for algorithmic training and evaluation purposes.

show various statistical attributes pertinent to the annotated toponym dataset. The length difference metric quantifies the disparity in character count between pairs of toponyms. The Levenshtein distance (Levenshtein Citation1966) acts as a metric for quantifying the similarity between two toponym strings. A smaller distance value indicates a higher level of resemblance between pairs of toponyms. Furthermore, the Jaccard similarity coefficient (Jaccard Citation1908) is another vital measure for evaluating toponym similarity. A larger Jaccard similarity coefficient indicates fewer disparities between two toponym records. Also, following the metric proposed by Santos et al. (2018), in , the distribution of toponym length is exhibited, measured in terms of both character count and word count. Moreover, portrays the distribution of the disparity in character count between matching and non-matching pairs of toponyms, which serves as an indicator of the level of difficulty encountered in the toponym matching task. Notably, pairs with a substantial difference in character count are expected to present higher challenges in classification.

Figure 4. The distribution of toponym lengths within the Alidataset.

Figure 5. The distribution of toponym lengths within the SZdataset.

Figure 6. The distribution of toponym lengths within the HBdataset.

Figure 7. Discrepancies observed in the character counts between matching and non-matching pairs across the three datasets.

Table 2. Statistical characteristics of the labeled address dataset from Alidataset.

Download CSV Display Table

Table 3. Statistical characteristics of the labeled address dataset from SZdataset.

Download CSV Display Table

Table 4. Statistical characteristics of the labeled address dataset from HBdataset.

Download CSV Display Table

4.2. Baselines

In order to assess the performance of our proposed model in a rigorous manner, we conducted an empirical comparison with models presented in ten prior approaches, thus establishing a comprehensive benchmark. To ensure fair comparisons, we utilized the publicly accessible source code provided by the aforementioned research endeavors. Moreover, we adopted the default parameter configurations as reported in those papers, ensuring consistency in the evaluation process. The comparison models employed in this study are succinctly summarized as follows:

ABCNN uses a general Attention Based Convolutional Neural Network to determine if it matches or not (Yin et al. 2016).
AlBERT, as proposed by Lan et al. (Citation2019), introduces two innovative techniques for parameter reduction in BERT (Devlin et al. Citation2018). These techniques aim to mitigate memory consumption while simultaneously enhancing the efficiency of training procedures.
BERT, is a novel language representation model introduced by Devlin et al. (Citation2018). BERT effectively captures contextual information from both preceding and subsequent words in a sequence, thereby enabling enhanced understanding and representation of textual data.
BIMPM represents a bilateral multi-perspective matching model based on the ‘matching-aggregation’ framework (Wang, Hamza, and Florian Citation2017).
Decomposable Attention is a straightforward attention-based methodology for natural language inference that exhibits inherent parallelizability (Parikh et al. Citation2016).
SiaGRU model is a variant of a bi-directional LSTM network for labeling data consisting of variable-length sequences (Huang et al. Citation2013).
RoBERT, a refined iteration of the BERT model, demonstrates superior performance compared to all subsequent post-BERT methods, as highlighted by Liu et al. (Citation2019).
XLNet, as proposed by Yang et al. (Citation2019), introduces a generalized autoregressive pretraining method that offers two key advancements in language modeling.
DistillBERT, as presented by Sanh et al. (Citation2019), offers a proficient approach to pre-training a compact and versatile language representation model. This model can subsequently be fine-tuned to achieve commendable performance across a diverse array of tasks, akin to its larger counterparts.
ESIM is a model that integrates the application of BiLSTM and attention mechanisms (Chen et al. Citation2016).

4.3. Evaluation metrics

Within this study, we have incorporated accuracy, precision (P), recall (R) and F1 as evaluation metrics to validate the performance of the proposed model. Accuracy measures number of correctly matched samples as a percentage of the total. Precision calculates the percentage of correctly identified matched names (noted as True Positives, TP) among all the matched names predicted by the model, which combines both TP and False Positives (FP). Recall measures the percentage of correctly matched ones amongst all ground truth, which is the combination of TP and False Negatives (FN). F1-score is the harmonic mean of precision and recall, providing a comprehensive metric to evaluate model performance. Additionally, we have conducted a comprehensive analysis of cases wherein the method generated erroneous results, complementing the aforementioned metrics (Santos et al. Citation2018a).

These metrics afford a comprehensive evaluation framework, enabling a thorough assessment of the proposed method’s performance and shedding light on the specific areas where improvements are warranted. (1) $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (1) (2) $Precision = \frac{TP}{TP + FP}$ (2) (3) $Recall = \frac{TP}{TP + FN}$ (3) (4) $F_{1} = \frac{2^{*} Precisio n^{*} Recall}{Precision + Recall}$ (4)

To enhance the reliability of our experiments, we employed a twofold cross-validation methodology using the mentioned datasets. This method entailed dividing the toponym pairs into two distinct subsets, ensuring an equitable inclusion of matching and nonmatching pairs. The aim of this division was to effectively evaluate the performance of our proposed method while mitigating potential biases and ensuring generalizability.

4.4. Experiment configuration

In this study, the hyperparameters for the proposed model are detailed in . In order to investigate the most effective blend of hyperparameters for the proposed model, a systematic tuning procedure was conducted. This involved adjusting specific hyperparameters within their feasible value ranges (Bergstra and Bengio Citation2012). Following previous studies (Chen et al. Citation2016), the selection of these hyperparameters was based on their significant impact on the predictive accuracy of deep learning models in various scenarios. By evaluating the performance of different hyperparameter settings on the development set, we aimed to identify the most suitable configuration that maximizes the model’s predictive accuracy and generalization capability.

Table 5. Main hyperparameter settings for the proposed model in this paper.

Download CSV Display Table

4.5. The obtained results

To verify the proposed approach on the three toponym datasets, a series of initial experiments were carried out. These experiments effectively compare the performance of (1) distinct string similarity metrics, (2) supervised learning techniques that integrate these various string similarity metrics, and (3) the deep learning method. In addition to accuracy, precision, recall, and F1, and, we also provide the average processing time for applying each method to sets of 1000 records.

presents the experimental results obtained from the Ali dataset and compares them with ten other deep learning methods discussed in Section 4.2, namely ABCNN, AlBERT, BERT, BIMPM, Decomposable Attention, SiaGRU, RoBERT, XLNet, Distillbert, and ESIM. The results, as shown in , indicate that our proposed method outperforms all other methods across all four metrics, achieving accuracy, precision, recall, and F1 values of 0.815, 0.945, 0.889, and 0.916, respectively. Furthermore, we also included the average runtime for each method on 1000 records, which amounts to 0.2978s for our proposed model. In comparison, the model employing ESIM achieves accuracy, precision, recall, and F1 values of 0.6802, 0.7754, 0.6802, and 0.7247, respectively. This performance difference can primarily be attributed to the fact that our proposed model is trained with word embedding representation models that encompass various word forms like Pinyin, Wubi, Radicals, and Strokes. As a result, it effectively captures the semantic features of words.

Table 6. Experimental results obtained with the different methods under consideration on the Ali dataset.

Download CSV Display Table

presents the experimental results obtained from the SZdataset dataset and compares them with ten other deep learning methods discussed in Section 4.2. A thorough comparison was conducted, and as indicated in , while the accuracy, precision, recall, and F1 values for models such as ALBERT, BERT, BIMPM, RoBERT, XLNet, and ESIM surpass 0.89, all of them fall short in performance compared to the GeoBERTTM model proposed in this paper. Our proposed model outperforms them, achieving accuracy, precision, recall, and F1 values of 0.978, 0.98, 0.967, and 0.973, respectively. The simple reason is that these pre-trained embeddings are specifically designed to accurately capture the semantic essence of the words. Additionally, the incorporation of the ESIM model enables the capturing of local features between texts through the LSTM and attention mechanism, resulting in more precise place name matching.

Table 7. Experimental results obtained with the different methods under consideration on the SZdataset.

Download CSV Display Table

Taking inspiration from Lin et al. (Citation2020), we assessed the accuracy of our deep learning method for toponym matching. To achieve this, we conducted a comparative analysis between our approach and several existing address matching methods, focusing on predictive accuracy using the test set. For our deep learning method, we utilized the optimal combination of hyperparameters as described in the previous section. Initially, we compared the proposed method with string similarity-based toponym matching methods. These methods involved measuring the relevance between two toponym records using the Levenshtein distance, Jaccard similarity coefficient, and Jaro similarity. Furthermore, we employed random forests (RF) and support vector machine (SVM) classifiers to determine if the address records matched. Additionally, we evaluated our method against a machine learning-based toponym matching approach that employed word2vec to convert address records into vectors and employed machine learning classifiers for predictions. The toponym matching accuracy results for these methods are presented in .

Table 8. Experimental results obtained by the proposed method on the HNDataset.

Download CSV Display Table

provides compelling evidence that the utilization of string similarity-based methodologies for address matching can yield a commendable level of predictive accuracy, particularly when employing the Jaccard similarity coefficient in conjunction with certain classifiers (S3 and S4). Furthermore, it was observed that, with fixed string similarity metrics, the implementation of RF as the classifier resulted in a notable surge in predictive accuracy (S1, S3, and S5). Notably, among all the string similarity-based methods, the Jaccard similarity coefficient + SVM demonstrated high precision, achieving precision, recall, and F1 values of 0.911, 0.912, and 0.911, respectively (S4). Nevertheless, our novel deep learning-based address matching approach showcased superior performance when compared to the S3 method. The precision, recall, and F1 scores attained by GeoBERTTM on the test set reached an impressive 0.965. Consequently, our proposed method surpassed the conventional text-matching techniques in terms of matching accuracy.

In conclusion, an extensive comparison was conducted between the direct implementation of word2vec embeddings for classification (S9 and S10) and our proposed method. While the precision of S10 demonstrated a noteworthy value of 0.92, it is crucial to highlight that upon integrating ESIM into address matching, all evaluation metrics achieved an impressive score of 0.96. These outcomes emphasize the significant enhancements in predictive accuracy on the test set, brought about by the deep text-matching model GeoBERTTM.

presents the experimental results obtained from the HBdataset dataset and compares them with ten other deep learning methods discussed in Section 4.2. A comprehensive comparison was conducted to evaluate their performance. As depicted in , among the ten place name matching models, the GeoBERTTM model proposed in this paper demonstrates the highest performance. It achieves accuracy, precision, recall, and F1 values of 0.977, 0.977, 0.965, and 0.97, respectively, surpassing the performance of all other models by more than 2%, apart from the ABCNN model. The obtained experimental results unequivocally demonstrate that the model presented in this paper exhibits superior performance on our self-constructed HBdataset dataset, thereby confirming the effectiveness and validity of the dataset.

Table 9. Experimental results obtained with the different methods under consideration on the HBdataset dataset.

Download CSV Display Table

4.6. Parameter sensitivity analysis

We investigate the impacts of several key hyper-parameters in GeoBERTTM, which are learning rate l, number of hidden nodes n, mini-batch size k, and epoch m for each observed check-in.

firstly presents the impacts of learning rate. The performance of GeoBERTTM increases with the learning rare increased. When the learning rate was l = 0.0005, and with the F1 score consistently hovering around 0.97, we have found that setting l = 0.0005 strikes a balance between recommendation performance and training times.

Figure 8. Impacts of learning rate on the HBdataset. P = Precision, R = Recall, and F1 is the harmonic mean of the precision and recall.

As depicted in , the F1 score on the development set remained consistently above 0.97 when adjusting the number of hidden nodes (n) between 100 and 400. Notably, the model attained the highest predictive accuracy on the development set when n = 300. However, as indicated in , increasing the number of hidden nodes to n = 350 resulted in a decline in predictive accuracy on the development set. This suggests that a larger number of hidden nodes (350 or 400) introduces training challenges and may cause overfitting. Consequently, we have chosen to set the number of hidden nodes to n = 300.

Figure 9. Impacts of number of hidden nodes on the HBdataset. P = Precision, R = Recall, and F1 is the harmonic mean of the precision and recall.

According to , the performance of the model exhibits an upward-then-downward trend when varying the mini-batch size from 20 to 50 in increments of 5. The F1 scores on the development set hover slightly above 0.96 when the mini-batch size is either 25 or 50. However, when the mini-batch size is set to 60, the F1 scores display a relatively large fluctuation, indicating that this size is excessively large. Furthermore, reveals that the prediction accuracy on the development set is at its lowest when the mini-batch size is 25, suggesting that this value is too small. Consequently, we have opted for a mini-batch size of 50 in subsequent experiments.

Figure 10. Impacts of mini-batch size on the HBdataset. P = Precision, R = Recall, and F1 is the harmonic mean of the precision and recall.

The impacts of epoch m on the performance of GeoBERTTM are presented in . We can observe that increasing epoch does provide performance improvements for toponym matching as m increases. GeoBERTTM achieves its best performance when m is 50 on the HBdataset dataset. However, the performance of GeoBERTTM drops with a larger m, and a larger m costs longer training times as well. So we set m = 50 for the HBdataset dataset.

Figure 11. Impacts of epoch on the HBdataset. P = Precision, R = Recall, and F1 is the harmonic mean of the precision and recall.

4.7. Case study and error analysis on the HBdataset

Lastly, the effectiveness of the proposed neural network architecture is showcased through the presentation of . This table exhibits instances of record pairs, encompassing both matching and nonmatching pairs, which were correctly or incorrectly classified by the model. While the table encompasses a limited number of examples, an extensive manual analysis further substantiated the superior performance of the proposed neural network architecture in such scenarios.

Table 10. A set of illustrative error examples for the results with the proposed approach on HBdataset.

Download CSV Display Table

We focused on the analysis of matching errors and summarized them. The mis-matches can be classified into seven categories:

44% of the misclassifications were due to different toponym prefix ranges. The prefixes of some toponyms are classified at the district and city level, while the prefixes of some toponyms are classified at the road level, which leads to matching errors.
17% of the errors are due to the lack of prefix information, in which case the model may not be able to accurately understand the content of the toponym, e.g. it is not possible to determine the mismatch of toponyms in different areas but with the same name.
15% of the error classification occurs in the case of detail differences, such as the differences in the name of the district, door number, and building number can lead to matching errors.
14% of the errors are caused by large differences in string length.
3% of the errors are caused by not identifying the toponym of the containment relationship, the most common containment relationships such as a shopping mall and the shops in that mall, a community and the specific building or gateway in that community.
3% of the errors are caused by unintelligible toponyms with different names in the same place, where the same toponym may be named differently due to factors such as history and application scenarios. In this case, the model may treat them as mismatched toponym pairs.
4% of the errors were due to colloquialisms, which may be due to the use of informal or colloquial expressions in a toponym that the model cannot accurately understand.

5. Discussion

This research offers a new approach for automatically matching address from natural language texts, leveraging a deep neural network with geographic pre-training model to classify pairs of address as either matching or nonmatching. From an intellectual standpoint, this is the first attempt to enhance domain-specific address matching performance by combining a deep learning-based algorithm with a variety of embeddings (Pinyin, Wubi, Radicals, and Strokes). The suggested deep learning-based solution can decrease the amount of human intervention necessary during the address matching process. The adopted embeddings can provide rich semantics from the computational linguistics area and identify informative words in a sentence, allowing for a more in-depth comprehension of the text and improved capture of domain-specific aspects. From a practical standpoint, applying the suggested address matching technique might aid in streamlining the geocoding of urban data and advancing the geospatial administration of urban systems. As a result, integrating the suggested method’s use with other data, information, or knowledge or with other software systems already in place should be simple. A module to match a queried address with one in the address database, for instance, could be easily added to a number of applications/software related to geographical information retrieval and the geographical information sciences. This would enable the queried address to be converted into a geographical coordinate and, consequently, located on a map.

Three main limitations of the work are acknowledged, which point to four directions of future work. First, we do not consider the problem of duplicate place names in this paper because the address database used is limited to a single province (e.g. Hubei Province), and the ambiguity of place names is negligible. However, if our method is applied to a higher-level address database (e.g. a national address database), the model will not be able to resolve two cognitive place names with the same address elements but different locations in the training of word vectors. In addition, when training the address element vectors, we did not take into account the hierarchical relationship between them; that is, we considered address elements of different hierarchical levels as equally important and assigned the values of the word vectors based only on their word frequencies. Therefore, in our future work, we plan to adopt a word embedding approach that incorporates domain ontologies, incorporates more knowledge of place names, and converts address elements into vectors, which may help to solve this problem. Third, due to the data-hungry nature of deep learning, data availability and quality are inevitable topics to be discussed when it comes to large and complex deep learning models. It is common knowledge in the deep learning field that the larger the dataset, the better the generality and performance. However, large-scale, high-quality datasets mean that significant labor costs are required, and few-shot or zero-shot learning needs to be considered in the future. Fine-tuned language models can be zero-shot or few-shot learners, which means that these models can be applied directly to certain downstream tasks with little or no further training. This is because high-level language models can better capture the meaning of text. This is also demonstrated in this paper by the results of using BERT to enhance the module capabilities. Therefore, the inclusion of giant models such as GPT-3 and GPT-4 may lead to a new round of performance improvement.

6. Conclusions and future work

In this study, we propose a novel deep learning architecture that demonstrates efficacy in the realm of semantic toponym matching. Our approach encompasses two key steps. Firstly, we subject the BERT model to pre-training on an extensive corpus of geo-domain texts, which enables the acquisition of distributed embedding representations for characters, thereby resulting in GeoBERT. Subsequently, we utilize word2vec to convert the input toponym records, enriched with Pinyin, Wubi, Radicals, and Strokes features, into vector format. These vectors are then fed into the ESIM, a deep text-matching model, to compute the semantic similarity between compared toponym records and ascertain their degree of match. To validate our proposed method, we conducted experiments using three distinct toponym databases, including a carefully constructed toponym corpus. The experimental results demonstrate that the ESIM achieves optimal performance when configured with 300 hidden nodes, a learning rate of 0.0005, and a minibatch size of 50. Furthermore, our deep learning approach outperforms existing toponym matching methods in terms of matching accuracy, particularly when dealing with unstructured queried toponyms. Notably, our method excels at capturing the intricate character transliterations encountered in certain pairs of toponyms.

Although the paper presents interesting findings, it also presents numerous avenues for future research. For instance, exploring the utilization of advanced optimization techniques to train deep neural networks and employing systematic approaches to fine-tune hyperparameters would be worthwhile. Additionally, experimenting with diverse model architectures and knowledge representations, including models like ChatGPT, which draw inspiration from recent advancements in textual entailment and related NLP tasks, holds promise. In future work, we intend to introduce a strategy that assigns varying weights to toponym element vectors based on their hierarchical structure, thereby enhancing the accuracy of toponym matching.

Acknowledgments

We would like to express our great appreciation to the editors and two anonymous reviewers for constructive comments that helped improve the manuscript.

Disclosure statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability statements

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Additional information

Funding

This study was financially supported by the National Key R&D Program of China (No. 2022YFB3904200, 2022YFF0711601), the Natural Science Foundation of China (No. 42301492), the Open Fund of Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering (No. 2022SDSJ04), and the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources (No. KF-2022-07-014).

References

Alsudais, A., W. Alotaibi, and F. Alomary. 2022. “Similarities Between Arabic Dialects: Investigating Geographical Proximity.” Information Processing & Management 59 (1): 102770. https://doi.org/10.1016/j.ipm.2021.102770
Web of Science ®Google Scholar
Amir, A., Y. Aumann, G. Benson, A. Levy, O. Lipsky, E. Porat, S. Skiena, and U. Vishne. 2009. “Pattern Matching with Address Errors: Rearrangement Distances.” Journal of Computer and System Sciences 75 (6): 359–370. https://doi.org/10.1016/j.jcss.2009.03.001
Web of Science ®Google Scholar
Bergstra, J., and Y. Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research 13 (2): 281–305.
Google Scholar
Berkhin, P., M. R. Evans, F. Teodorescu, W. Wu, and D. Yankov. 2015. “A New Approach to Geocoding: BingGC.” In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems, edited by Mohamed Ali Yan Huang, 1–10. New York: Association for Computing Machinery.
Google Scholar
Buckles, B., J. Buckley, and F. E. Petry. 1994. “Architecture of FAME: Fuzzy Address Matching Environment.” In Proceedings of 1994 IEEE 3rd International Fuzzy Systems Conference, edited by Nicole McFarlane, 308–312. Orlando, FL, USA: IEEE.
Google Scholar
Cao, S., W. Lu, J. Zhou, and X. Li. 2018. “cw2vec: Learning Chinese Word Embeddings with Stroke n-Gram Information.” In Proceedings of the AAAI Conference on Artificial Intelligence, edited by Palo Alto, 5053–5061. California: AAAI Press.
Google Scholar
Chen, J., J. Chen, X. She, J. Mao, and G. Chen. 2021. “Deep Contrast Learning Approach for Address Semantic Matching.” Applied Sciences 11 (16): 7608. https://doi.org/10.3390/app11167608
Google Scholar
Chen, Q., X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen. 2016. “Enhanced LSTM for Natural Language Inference.” arXiv preprint arXiv:1609.06038.
Google Scholar
Cheng, J., L. Dong, and M. Lapata. 2016. “Long Short-term Memory-networks for Machine Reading.” arXiv preprint arXiv:1601.06733.
Google Scholar
Cheng, R., J. Liao, and J. Chen. 2022. “Quickly Locating POIs in Large Datasets from Descriptions Based on Improved Address Matching and Compact Qualitative Representations.” Transactions in GIS 26 (1): 129–154. https://doi.org/10.1111/tgis.12838
Web of Science ®Google Scholar
Comber, S., and D. Arribas-Bel. 2019. “Machine Learning Innovations in Address Matching: A Practical Comparison of word2vec and CRFs.” Transactions in GIS 23 (2): 334–348. https://doi.org/10.1111/tgis.12522
Web of Science ®Google Scholar
Devlin, J., M. W. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Google Scholar
Eidoon, Z., N. Yazdani, and F. Oroumchian. 2008. “Ontology Matching Using Vector Space.” In Advances in Information Retrieval: 30th European Conference on IR Research, edited by Craig Macdonald Iadh Ounis and Vassilis Plachouras Ian Ruthven, 472–481. Berlin Heidelberg: Springer.
Google Scholar
Fan, Y., L. Pang, J. Hou, J. Guo, Y. Lan, and X. Cheng. 2017. “Matchzoo: A Toolkit for Deep Text Matching.” arXiv preprint arXiv:1707.07270.
Google Scholar
Hochreiter, S., and J. Schmidhuber. 1997. “Long short-term memory.” Neural computation 9 (8): 1735–1780.
PubMed Web of Science ®Google Scholar
Hu, X., H. S. Al-Olimat, J. Kersten, M. Wiegmann, F. Klan, Y. Sun, and H. Fan. 2022b. “GazPNE: Annotation-Free Deep Learning for Place Name Extraction from Microblogs Leveraging Gazetteer and Synthetic Data by Rules.” International Journal of Geographical Information Science 36 (2): 310–337. https://doi.org/10.1080/13658816.2021.1947507
Web of Science ®Google Scholar
Hu, X., Y. Hu, B. Resch, and J. Kersten. 2023a. “Geographic Information Extraction from Texts (GeoExT).” In European Conference on Information Retrieval, edited by Jaap Kamps Lorraine Goeuriot, 398–404. Cham: Springer Nature Switzerland.
Google Scholar
Hu, X., Y. Sun, J. Kersten, Z. Zhou, F. Klan, and H. Fan. 2023b. “How Can Voting Mechanisms Improve the Robustness and Generalizability of Toponym Disambiguation?” International Journal of Applied Earth Observation and Geoinformation 117: 103191. https://doi.org/10.1016/j.jag.2023.103191
Web of Science ®Google Scholar
Hu, X., Z. Zhou, H. Li, Y. Hu, F. Gu, J. Kersten, H. Fan, and F. Klan. 2022a. “Location Reference Recognition from Texts: A Survey and Comparison.” arXiv preprint arXiv:2207.01683.
Google Scholar
Hu, X., Z. Zhou, Y. Sun, J. Kersten, F. Klan, H. Fan, and M. Wiegmann. 2022c. “GazPNE2: A General Place Name Extractor for Microblogs Fusing Gazetteers and Pretrained Transformer Models.” IEEE Internet of Things Journal 9 (17): 16259–16271. https://doi.org/10.1109/JIOT.2022.3150967
Web of Science ®Google Scholar
Huang, P. S., X. He, J. Gao, L. Deng, A. Acero, and L. Heck. 2013. “Learning Deep Structured Semantic Models for web Search Using Clickthrough Data.” In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, edited by Qi He Arun Iyengar, 2333–2338. New York: Association for Computing Machinery.
Google Scholar
Jaccard, P. 1908. “Nouvelles Recherches sur la Distribution Florale.” Bull. Soc. Vaud. Sci. Nat 44: 223–270.
Google Scholar
Koumarelas, I., A. Kroschk, C. Mosley, and F. Naumann. 2018. “Experience: Enhancing Address Matching with Geocoding and Similarity Measure Selection.” Journal of Data and Information Quality (JDIQ) 10 (2): 1–16. https://doi.org/10.1145/3232852
Google Scholar
Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2019. “Albert: A Lite Bert for Self-supervised Learning of Language Representations.” arXiv preprint arXiv:1909.11942.
Google Scholar
Levenshtein, V. I. 1966. “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals.” Soviet Physics Doklady 10 (8): 707–710.
Google Scholar
Li, J., B. Chiu, S. Feng, and H. Wang. 2020a. “Few-shot Named Entity Recognition via Meta-Learning.” IEEE Transactions on Knowledge and Data Engineering 34 (9): 4245–4256. https://doi.org/10.1109/TKDE.2020.3038670
Web of Science ®Google Scholar
Li, J., S. Feng, and B. Chiu. 2023. “Few-shot Relation Extraction with Dual Graph Neural Network Interaction.” IEEE Transactions on Neural Networks and Learning Systems 1–13.
Web of Science ®Google Scholar
Li, J., P. Han, X. Ren, J. Hu, L. Chen, and S. Shang. 2021. “Sequence Labeling with Meta-Learning.” IEEE Transactions on Knowledge and Data Engineering 35 (3): 3072–3086.
Web of Science ®Google Scholar
Li, F., Y. Lu, X. Mao, J. Duan, and X. Liu. 2022. “Multi-task Deep Learning Model Based on Hierarchical Relations of Address Elements for Semantic Address Matching.” Neural Computing and Applications 34 (11): 8919–8931. https://doi.org/10.1007/s00521-022-06914-1
Web of Science ®Google Scholar
Li, J., S. Shang, and L. Chen. 2020b. “Domain Generalization for Named Entity Boundary Detection via Metalearning.” IEEE Transactions on Neural Networks and Learning Systems 32 (9): 3819–3830. https://doi.org/10.1109/TNNLS.2020.3015912
Web of Science ®Google Scholar
Li, L., W. Wang, B. He, and Y. Zhang. 2018. “A Hybrid Method for Chinese Address Segmentation.” International Journal of Geographical Information Science 32 (1): 30–48. https://doi.org/10.1080/13658816.2017.1379084
Web of Science ®Google Scholar
Li, D., S. Wang, and Z. Mei. 2010. “Approximate Address Matching.” In 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing, edited by Leopoldo G. Franquelo , 264–269. Fukuoka, Japan: IEEE.
Google Scholar
Lin, Y., M. Kang, Y. Wu, Q. Du, and T. Liu. 2020. “A deep learning architecture for semantic address matching.” International Journal of Geographical Information Science 34 (3): 559–576.
Web of Science ®Google Scholar
Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. “Roberta: A Robustly Optimized Bert Pretraining Approach.” arXiv preprint arXiv:1907.11692.
Google Scholar
Ma, X. 2022. “Knowledge Graph Construction and Application in Geosciences: A Review.” Computers & Geosciences 161: 105082. https://doi.org/10.1016/j.cageo.2022.105082
Web of Science ®Google Scholar
Mauro, N., L. Ardissono, and M. Lucenteforte. 2020. “Faceted Search of Heterogeneous Geographic Information for Dynamic map Projection.” Information Processing & Management 57 (4): 102257. https://doi.org/10.1016/j.ipm.2020.102257
Web of Science ®Google Scholar
Moreau, E., F. Yvon, and O. Cappé. 2008. “Robust Similarity Measures for Named Entities Matching.” In Proceedings of the 22nd International Conference on Computational Linguistics, edited by Donia Scott Hans Uszkoreit, 593–600. Manchester, UK: Association for Computational Linguistics.
Google Scholar
Moura, T. H., C. A. Davis Jr, and F. T. Fonseca. 2017. “Reference Data Enhancement for Geographic Information Retrieval Using Linked Data.” Transactions in GIS 21 (4): 683–700. https://doi.org/10.1111/tgis.12238
Web of Science ®Google Scholar
Parikh, A. P., O. Täckström, D. Das, and J. Uszkoreit. 2016. “A Decomposable Attention Model for Natural Language Inference.” arXiv preprint arXiv:1606.01933.
Google Scholar
Qin, T., F. Ren, T. Hu, J. Liu, R. Li, and Q. Du. 2016. “Using an Optimized Chinese Address Matching Method to Develop a Geocoding Service: A Case Study of Shenzhen, China.” ISPRS International Journal of Geo-Information 5 (5): 65. https://doi.org/10.3390/ijgi5050065.
Web of Science ®Google Scholar
Qiu, Q., Z. Xie, K. Ma, Z. Chen, and L. Tao. 2022b. “Spatially Oriented Convolutional Neural Network for Spatial Relation Extraction from Natural Language Texts.” Transactions in GIS 26 (2): 839–866. https://doi.org/10.1111/tgis.12887
Web of Science ®Google Scholar
Qiu, Q., Z. Xie, K. Ma, L. Tao, and S. Zheng. 2023. “NeuroSPE: A Neuro-net Spatial Relation Extractor for Natural Language Text Fusing Gazetteers and Pretrained Models.” Transactions in GIS 27 (5): 1526–1549. https://doi.org/10.1111/tgis.13086
Web of Science ®Google Scholar
Qiu, Q., Z. Xie, S. Wang, Y. Zhu, H. Lv, and K. Sun. 2022a. “ChineseTR: A Weakly Supervised Toponym Recognition Architecture Based on Automatic Training Data Generator and Deep Neural Network.” Transactions in GIS 26 (3): 1256–1279. https://doi.org/10.1111/tgis.12902
Web of Science ®Google Scholar
Recchia, G., and M. Louwerse. 2013. “A Comparison of String Similarity Measures for Toponym Matching”.
Google Scholar
Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. “DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter.” arXiv preprint arXiv:1910.01108.
Google Scholar
Santos, J., I. Anastácio, and B. Martins. 2015. “Using Machine Learning Methods for Disambiguating Place References in Textual Documents.” GeoJournal 80 (3): 375–392. https://doi.org/10.1007/s10708-014-9553-y
Web of Science ®Google Scholar
Santos, R., P. Murrieta-Flores, P. Calado, and B. Martins. 2018b. “Toponym Matching Through Deep Neural Networks.” International Journal of Geographical Information Science 32 (2): 324–348. https://doi.org/10.1080/13658816.2017.1390119
Web of Science ®Google Scholar
Santos, R., P. Murrieta-Flores, and B. Martins. 2018a. “Learning to Combine Multiple String Similarity Metrics for Effective Toponym Matching.” International Journal of Digital Earth 11 (9): 913–938. https://doi.org/10.1080/17538947.2017.1371253
Web of Science ®Google Scholar
Shan, S., Z. Li, Q. Yang, A. Liu, L. Zhao, G. Liu, and Z. Chen. 2020. “Geographical Address Representation Learning for Address Matching.” World Wide Web 23 (3): 2005–2022. https://doi.org/10.1007/s11280-020-00782-2
Google Scholar
Su, T. R., and H. Y. Lee. 2017. “Learning Chinese Word Representations from Glyphs of Characters.” arXiv preprint arXiv:1708.04755.
Google Scholar
Sun, Z., A. G. Qiu, J. Zhao, F. Zhang, Y. Zhao, and L. Wang. 2013. “Technology of Fuzzy Chinese-Geocoding Method.” In 2013 International Conference on Information Science and Cloud Computing, edited by W. Dale Blair, 7–12. Guangzhou, China: IEEE.
Google Scholar
Varol, C., and C. Bayrak. 2012. “Hybrid Matching Algorithm for Personal Names.” Journal of Data and Information Quality (JDIQ 3 (4): 1–18. https://doi.org/10.1145/2348828.2348830
Google Scholar
Wang, Z., W. Hamza, and R. Florian. 2017. “Bilateral Multi-perspective Matching for Natural Language Sentences.” arXiv preprint arXiv:1702.03814.
Google Scholar
Winkler, W. E. 1990. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”.
Google Scholar
Xu, L., R. Mao, C. Zhang, Y. Wang, X. Zheng, X. Xue, and F. Xia. 2022. “Deep Transfer Learning Model for Semantic Address Matching.” Applied Sciences 12 (19): 10110. https://doi.org/10.3390/app121910110
Google Scholar
Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. 2019. “Xlnet: Generalized Autoregressive Pretraining for Language Understanding.” Advances in Neural Information Processing Systems 32.
Google Scholar
Yin, W., H. Schütze, B. Xiang, and B. Zhou. 2016a. “Abcnn: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs.” Transactions of the Association for Computational Linguistics 4: 259–272. https://doi.org/10.1162/tacl_a_00097
Google Scholar
Yin, R., Q. Wang, P. Li, R. Li, and B. Wang. 2016b. “Multi-granularity Chinese Word Embedding.” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, edited by Jian Su Kevin Duh, 981–986. Austin, Texas: Association for Computational Linguistics.
Google Scholar
Yu, J., X. Jian, H. Xin, and Y. Song. 2017. “Joint Embeddings of Chinese Words, Characters, and Fine-Grained Subcharacter Components.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, edited by Martha Palmer Rebecca Hwa and Sebastian Riedel, 286–291. Copenhagen, Denmark: Association for Computational Linguistics.
Google Scholar
Zhang, X., Y. Huang, C. Zhang, and P. Ye. 2022. “Geoscience Knowledge Graph (GeoKG): Development, Construction and Challenges.” Transactions in GIS 26 (6): 2480–2494. https://doi.org/10.1111/tgis.12985
Web of Science ®Google Scholar
Zhang, H., F. Ren, H. Li, R. Yang, S. Zhang, and Q. Du. 2020. “Recognition Method of new Address Elements in Chinese Address Matching Based on Deep Learning.” ISPRS International Journal of Geo-Information 9 (12): 745. https://doi.org/10.3390/ijgi9120745
Web of Science ®Google Scholar

A deep neural network model for Chinese toponym matching with geographic pre-training model

ABSTRACT

1. Introduction

2. Related work

2.1. Previous reseach on string matching

2.2. Previous reseach on toponym matching based on machine learning

3. The proposed methodology

3.1. Overall framework

3.2. Obtaining vector representations of toponym records

3.3. Deep semantic matching of toponym pairs

3.3.1. Input encoding layer

3.3.2. Local inference modeling layer

3.3.3. Inference composition layer

3.3.4. Prediction layer

4. Experimental results, and discussion

4.1. Dataset

Table 1. The labeled address dataset examples of several data rows from HBdataset.

Table 2. Statistical characteristics of the labeled address dataset from Alidataset.

Table 3. Statistical characteristics of the labeled address dataset from SZdataset.

Table 4. Statistical characteristics of the labeled address dataset from HBdataset.

4.2. Baselines

4.3. Evaluation metrics

4.4. Experiment configuration

Table 5. Main hyperparameter settings for the proposed model in this paper.

4.5. The obtained results

Table 6. Experimental results obtained with the different methods under consideration on the Ali dataset.

Table 7. Experimental results obtained with the different methods under consideration on the SZdataset.

Table 8. Experimental results obtained by the proposed method on the HNDataset.

Table 9. Experimental results obtained with the different methods under consideration on the HBdataset dataset.

4.6. Parameter sensitivity analysis

4.7. Case study and error analysis on the HBdataset

Table 10. A set of illustrative error examples for the results with the proposed approach on HBdataset.

5. Discussion

6. Conclusions and future work

Acknowledgments

Disclosure statement

Data availability statements

Additional information

Funding

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature