Culminating Project Title
Date of Award
Culminating Project Type
Applied Statistics: M.S.
Department of Mathematics and Statistics
College of Science and Engineering
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Sequence data (an ordered set of categorical states) is a very common type of data in Social Sciences, Genetics and Computational Linguistics.
For exploration and inference of sets of sequences, having a measure of dissimilarities among sequences would allow the data to be analyzed by techniques like clustering, multimensional scaling analysis and distance-based regression analysis. Sequences can be placed in a map where similar sequences are close together, and dissimilar ones will be far apart. Such patterns of dispersion and concentration could be related to other covariates. For example, do the employment trajectories of men and women tend to form separate clusters?
Optimal Matching (OM) distances have been proposed as a measure of dissimilarity between sequences. Assuming that sequences are empirical realizations of latent random objects, this thesis explores how good the ﬁt is between OM distances and original distances between the latent objects that generated the sequences, and the geometrical nature of such distortions.
Simulations show that raw OM dissimilarities are not an exact mirror of true distances and show systematic distortions. Common values for OM substitution and insertion/deletion costs produce dissimilarities that are metric, but not Euclidean. On the other hand, distances can be easily transformed to be Euclidean.
If diﬀering values of a covariate lead to diﬀerent latent random objects and thus diﬀerent sequences, are there tests with enough power to catch such variability, among the natural intersequence random variation? Such tests should be robust enough to cope with the non-euclideanity of OM distances.
A number of statistical tests (Permutational MANOVA, MRPP, Mantel’s correlation, and t-tests and median tests) were compared for statistical power, on associations between inter-item dissimilarities and a categorical explanatory variable. This thesis shows analytically that under simple conditions, the ﬁrst four tests are mathematically equivalent. Simulations conﬁrmed that tests had the same power. Tests are less powerful with longer sequences.
Zuluaga, Juan P., "Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation" (2013). Culminating Projects in Applied Statistics. 8.