The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Date of Award

12-2013

Culminating Project Type

Thesis

Degree Name

Applied Statistics: M.S.

Department

Department of Mathematics and Statistics

College

College of Science and Engineering

First Advisor

Hui Xu

Second Advisor

Shiju Zhang

Third Advisor

John Kulas

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Abstract

Sequence data (an ordered set of categorical states) is a very common type of data in Social Sciences, Genetics and Computational Linguistics.

For exploration and inference of sets of sequences, having a measure of dissimilarities among sequences would allow the data to be analyzed by techniques like clustering, multimensional scaling analysis and distance-based regression analysis. Sequences can be placed in a map where similar sequences are close together, and dissimilar ones will be far apart. Such patterns of dispersion and concentration could be related to other covariates. For example, do the employment trajectories of men and women tend to form separate clusters?

Optimal Matching (OM) distances have been proposed as a measure of dissimilarity between sequences. Assuming that sequences are empirical realizations of latent random objects, this thesis explores how good the fit is between OM distances and original distances between the latent objects that generated the sequences, and the geometrical nature of such distortions.

Simulations show that raw OM dissimilarities are not an exact mirror of true distances and show systematic distortions. Common values for OM substitution and insertion/deletion costs produce dissimilarities that are metric, but not Euclidean. On the other hand, distances can be easily transformed to be Euclidean.

If differing values of a covariate lead to different latent random objects and thus different sequences, are there tests with enough power to catch such variability, among the natural intersequence random variation? Such tests should be robust enough to cope with the non-euclideanity of OM distances.

A number of statistical tests (Permutational MANOVA, MRPP, Mantel’s correlation, and t-tests and median tests) were compared for statistical power, on associations between inter-item dissimilarities and a categorical explanatory variable. This thesis shows analytically that under simple conditions, the first four tests are mathematically equivalent. Simulations confirmed that tests had the same power. Tests are less powerful with longer sequences.

Comments/Acknowledgements

I would like to thank a number of people who provided essential support for the completion of this thesis: my advisor, Dr. Hui Xu, for his encouragement, interest, and patience; the members of my thesis committee; the faculty of the Statistics department in general; the Business Computing Research Laboratory (BCRL); the Office of Research and Sponsored Programs; Dr. Robert Johnson at Precollege Programs; the library of St. Cloud State University, especially Inter-Library Loan.

To Tina, thank you for your love and support all these years.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.