Framework for Syntactic String Similarity Measures
Najlah Gali, Radu Mariescu-Istodor, Damien Hostettler and Pasi Fränti
Expert Systems with Applications, 129 (1), 169-185, September 2019. (pdf)

This page demonstrates string similarity measures discussed in the paper above. A Java toolkit composed of all the measures is also available here. The toolkit contains multiple measures for computing syntactic similarity of strings. In addition, it allows to compute soft similarity of strings by combining character and token level functions. For comparison, semantic soft measures using Word2Vec are also supported. For Word2Vec measures to work, the enclosed Python Word2Vec service must be first started on a local machine and a suitable language model is required. We experimented with the Google News dataset available here.

The toolkit is written in Java and uses two libraries for computing string similarity: SecondString and SimMetric. The most interesting functions from these libraries were tailored to work together. We also implemented few other methods, namely: Longest Common Substring, Hamming, Damerau Levenshtein, Monge Elkan, Manhattan, Euclidean, Cosine and Chaudhuri.

Below is an interactive Demo of the similarity toolkit. It is possible to enter two strings and compute the similarity using any combination of token and character measures. Alternativelly, it is possible to compute similarities in batch. Simply drag a text file containing one string per line in the specified area and similarity between all string pairs will be computed as a result. Large files will take a long time to process.

DRAG
FILE
HERE

File API & FileReader API not supported

XHR2's FormData is not supported

XHR2's upload progress isn't supported

A :
B :
Unit similarity :
Group matching :