Learn more ...

Technology

This library is written in the programming language Scala and can therefore be used directly from Java. SBT is used as build tool. The library currently only needs two dependencies to other libraries: Apache Lucene (Java) and uPickle (also Scala). Up to version 1.1.0, the Spray json-spray library was used instead of uPickle.

Java libraries are easier to use in Scala than vice versa. Especially the Scala Collections are not so intuitive to use in Java from my point of view. But I didn't want to use the Java Collections in Scala for the interface either. This loses the beauty of Scala. Since the library is probably used for the most part in Java, I decided to implement a converter for the main collections in the public interface. With this converter the collections can be easily converted to Java. Thus both worlds can be optimally used and connected. If you program in Java, the class CollectionConv in the package esc.utils will be your friend.

The library does not handle exceptions. The architecture approach is that the exceptions are passed 1:1 and are intercepted and processed accordingly by the using application. Accordingly, it is important that you do a clean exception handling in your application. The library itself deliberately does not create log files. You have to convert output from the library into your desired log environment.

Architectur

The solution is logically divided into different packages:

  • Application: Contains the simple command-line application
  • Commons: Contains the case classes as data objects
  • Configuration: Contains the case with the configuration for the solution
  • Normalization: Contains the case classes for the name normalization
  • Similarity: Contains the case classes calculating the similarity
  • Index: Contains the case classes for indexing names and searching
  • Utils: Contains the help classes

The normalization of names is separate from the calculation of similarity. Normalization is computationally sensitive. The result, a normalized name, can be persisted. This allows a normalized name to be used for any later comparisons.

Performance

The normalization and the calculation of the similarity is computationally intensive. There are therefore two ways to use the library: You can calculate the similarity directly 1:1, or you can use the findPerson/Organization function of the Indexer/Finder. In the first case, each name must be compared with each name. With hundreds of thousands of names this can take a long time. First, each name must be normalized anew. Second, each name must really be compared with each other. If you index the names you want to search for in advance, you have two advantages which have a positive effect on the performance: First, each indexed name only needs to be normalized once. Second, the Finder is so intelligent that it finds a subset of names in the index. This eliminates the need to effectively compare each name with each other. So in most cases it is advisable to use the application via indexing.

The search via index can be parallelized. This allows the performance to be further optimized.

Roadmap

Please find the roadmap in the Wiki in the Github repository.

Match or no match? A few examples

The following examples were made with the default setting. The library has several effective settings to individually define the degree of similarity that makes it possible to speak of a hit.

#Name AName BMatchNo match
1Hans-Peter MüllerHanspeter Müller-Meyer 
2Wladimir JewtuschenkowVladimir Yevtushenkov 
3Christian MeierChristine Meier 
4Daniela MeyerDaniel Meier 
5Hu JintaoHu Chintao 
6Bill GatesWilliam Henry Gates 
7Marlone CortiMarlone Conti 
8UBS (Schweiz) AGUBS (Switzerland) Ltd. 
9Schneider Treuhand AGTreuhand Schnyder AG 
10Microsoft (Schweiz) LLCMicrospot (Schweiz) GmbH