Date of Award
12-2013
Culminating Project Type
Thesis
Degree Name
Applied Statistics: M.S.
Department
Department of Mathematics and Statistics
College
College of Science and Engineering
First Advisor
Hui Xu
Second Advisor
Shiju Zhang
Third Advisor
John Kulas
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Abstract
Sequence data (an ordered set of categorical states) is a very common type of data in Social Sciences, Genetics and Computational Linguistics.
For exploration and inference of sets of sequences, having a measure of dissimilarities among sequences would allow the data to be analyzed by techniques like clustering, multimensional scaling analysis and distance-based regression analysis. Sequences can be placed in a map where similar sequences are close together, and dissimilar ones will be far apart. Such patterns of dispersion and concentration could be related to other covariates. For example, do the employment trajectories of men and women tend to form separate clusters?
Optimal Matching (OM) distances have been proposed as a measure of dissimilarity between sequences. Assuming that sequences are empirical realizations of latent random objects, this thesis explores how good the fit is between OM distances and original distances between the latent objects that generated the sequences, and the geometrical nature of such distortions.
Simulations show that raw OM dissimilarities are not an exact mirror of true distances and show systematic distortions. Common values for OM substitution and insertion/deletion costs produce dissimilarities that are metric, but not Euclidean. On the other hand, distances can be easily transformed to be Euclidean.
If differing values of a covariate lead to different latent random objects and thus different sequences, are there tests with enough power to catch such variability, among the natural intersequence random variation? Such tests should be robust enough to cope with the non-euclideanity of OM distances.
A number of statistical tests (Permutational MANOVA, MRPP, Mantel’s correlation, and t-tests and median tests) were compared for statistical power, on associations between inter-item dissimilarities and a categorical explanatory variable. This thesis shows analytically that under simple conditions, the first four tests are mathematically equivalent. Simulations confirmed that tests had the same power. Tests are less powerful with longer sequences.
Recommended Citation
Zuluaga, Juan P., "Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation" (2013). Culminating Projects in Applied Statistics. 8.
https://repository.stcloudstate.edu/stat_etds/8
Comments/Acknowledgements
I would like to thank a number of people who provided essential support for the completion of this thesis: my advisor, Dr. Hui Xu, for his encouragement, interest, and patience; the members of my thesis committee; the faculty of the Statistics department in general; the Business Computing Research Laboratory (BCRL); the Office of Research and Sponsored Programs; Dr. Robert Johnson at Precollege Programs; the library of St. Cloud State University, especially Inter-Library Loan.
To Tina, thank you for your love and support all these years.