Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation

Juan P. Zuluaga

Abstract

Sequence data (an ordered set of categorical states) is a very common type of data in Social Sciences, Genetics and Computational Linguistics.

For exploration and inference of sets of sequences, having a measure of dissimilarities among sequences would allow the data to be analyzed by techniques like clustering, multimensional scaling analysis and distance-based regression analysis. Sequences can be placed in a map where similar sequences are close together, and dissimilar ones will be far apart. Such patterns of dispersion and concentration could be related to other covariates. For example, do the employment trajectories of men and women tend to form separate clusters?

Optimal Matching (OM) distances have been proposed as a measure of dissimilarity between sequences. Assuming that sequences are empirical realizations of latent random objects, this thesis explores how good the fit is between OM distances and original distances between the latent objects that generated the sequences, and the geometrical nature of such distortions.

Simulations show that raw OM dissimilarities are not an exact mirror of true distances and show systematic distortions. Common values for OM substitution and insertion/deletion costs produce dissimilarities that are metric, but not Euclidean. On the other hand, distances can be easily transformed to be Euclidean.

If differing values of a covariate lead to different latent random objects and thus different sequences, are there tests with enough power to catch such variability, among the natural intersequence random variation? Such tests should be robust enough to cope with the non-euclideanity of OM distances.

A number of statistical tests (Permutational MANOVA, MRPP, Mantel's correlation, and t-tests and median tests) were compared for statistical power, on associations between inter-item dissimilarities and a categorical explanatory variable. This thesis shows analytically that under simple conditions, the first four tests are mathematically equivalent. Simulations confirmed that tests had the same power. Tests are less powerful with longer sequences.