Date of Award

12-2013

Culminating Project Type

Thesis

Degree Name

Applied Statistics: M.S.

Department

Department of Mathematics and Statistics

College

College of Science and Engineering

First Advisor

Hui Xu

Second Advisor

Shiju Zhang

Third Advisor

John Kulas

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Abstract

Sequence data (an ordered set of categorical states) is a very common type of data in Social Sciences, Genetics and Computational Linguistics.

For exploration and inference of sets of sequences, having a measure of dissimilarities among sequences would allow the data to be analyzed by techniques like clustering, multimensional scaling analysis and distance-based regression analysis. Sequences can be placed in a map where similar sequences are close together, and dissimilar ones will be far apart. Such patterns of dispersion and concentration could be related to other covariates. For example, do the employment trajectories of men and women tend to form separate clusters?

Optimal Matching (OM) distances have been proposed as a measure of dissimilarity between sequences. Assuming that sequences are empirical realizations of latent random objects, this thesis explores how good the ﬁt is between OM distances and original distances between the latent objects that generated the sequences, and the geometrical nature of such distortions.

Simulations show that raw OM dissimilarities are not an exact mirror of true distances and show systematic distortions. Common values for OM substitution and insertion/deletion costs produce dissimilarities that are metric, but not Euclidean. On the other hand, distances can be easily transformed to be Euclidean.

If diﬀering values of a covariate lead to diﬀerent latent random objects and thus diﬀerent sequences, are there tests with enough power to catch such variability, among the natural intersequence random variation? Such tests should be robust enough to cope with the non-euclideanity of OM distances.

A number of statistical tests (Permutational MANOVA, MRPP, Mantel’s correlation, and t-tests and median tests) were compared for statistical power, on associations between inter-item dissimilarities and a categorical explanatory variable. This thesis shows analytically that under simple conditions, the ﬁrst four tests are mathematically equivalent. Simulations conﬁrmed that tests had the same power. Tests are less powerful with longer sequences.

Comments/Acknowledgements

I would like to thank a number of people who provided essential support for the completion of this thesis: my advisor, Dr. Hui Xu, for his encouragement, interest, and patience; the members of my thesis committee; the faculty of the Statistics department in general; the Business Computing Research Laboratory (BCRL); the Oﬃce of Research and Sponsored Programs; Dr. Robert Johnson at Precollege Programs; the library of St. Cloud State University, especially Inter-Library Loan.

To Tina, thank you for your love and support all these years.

Recommended Citation

Zuluaga, Juan P., "Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation" (2013). Culminating Projects in Applied Statistics. 8.
https://repository.stcloudstate.edu/stat_etds/8

Download

Included in

Applied Statistics Commons

COinS

The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Culminating Projects in Applied Statistics

Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation

Date of Award

Culminating Project Type

Degree Name

Department

College

First Advisor

Second Advisor

Third Advisor

Creative Commons License

Abstract

Comments/Acknowledgements

Recommended Citation

Included in

Search

Browse

Author Corner

Links

The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Culminating Projects in Applied Statistics

Optimal Matching Distances between Categorical Sequences: Distortion and Inferences by Permutation

Author

Date of Award

Culminating Project Type

Degree Name

Department

College

First Advisor

Second Advisor

Third Advisor

Creative Commons License

Abstract

Comments/Acknowledgements

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links