Development of an AI language model for extracting address information
Project duration: 1 year, 2 months
Brief description
The aim of the project is to investigate which approaches are suitable for automatically extracting the components of an address from text input. AI language models from the field of deep learning are of particular interest here. PTA examines the state of the art and identifies suitable methods based on neural networks. Using synthetic test data sets and metrics, PTA evaluates various approaches and architectures with the aim of identifying suitable methods for customer use.
Supplement
To ensure data protection, PTA generates synthetic recipient addresses on the basis of a street section file and various name lists (first name, surname, company name). In order to create a realistic data set, the address components are randomly arranged based on standard patterns and spelling errors are randomly generated. For example, letters are removed or swapped with a certain probability. PTA uses the Python framework PyTorch Sequence-to-Sequence (Seq2Seq) to develop language models and subjects them to a benchmark based on the previously generated synthetic test data set. Models based on the Seq2Seq architecture essentially consist of two components, an encoder and a decoder. While it is the task of the encoder to understand the input text, it is the task of the decoder to annotate the components of the input text with the corresponding tags (name, street, city, …).
Subject description
A central component of the customer's business model is based on the processing of address data. The recipient addresses are transmitted to the customer via various input channels and interfaces. It is not uncommon for parts of the recipient addresses (title, first name, surname, company name, street, house number, house number suffix, zip code, town and district) to be incorrect and, for example, mixed up. The reasons for this are usually incorrect entries during the ordering process or a lack of address validation by the customer's client. Incorrect addresses often lead to consignments being sorted incorrectly and therefore not being transported to the intended destination. They also make route planning more difficult during the delivery process, as route planning is based on the geocoordinates of an address, which can usually only be determined correctly if the individual components of an address have been recognized correctly.