Date of Award
5-2006
Culminating Project Type
Thesis
Degree Name
Computer Science: M.S.
Department
Computer Science and Information Technology
College
School of Science and Engineering
First Advisor
Ramnath Sarnath
Second Advisor
Bryant Julstrom
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Keywords and Subject Headings
LCS, GA, Genetic Algorithm, Longest Common Subsequence, Computer Science, Technology
Abstract
The Longest Common Subsequence problem (LCS) is a known NP-complete problem that computes the longest subsequence (series of characters occurring in the same order, although not necessarily consecutively) that any number of strings share. An LCS is not necessarily unique for any combination of strings; however, the length will be. The computationally difficult version of this problem occurs when the number of strings and the LCS length are not fixed. The problem has a number of applications: anything from searching content to file difference listings. There is no single solution that fits all situations, and the deterministic solutions available are written for a set number of strings and/or fixed string lengths. There is still a considerable amount of active research in this area.
For this project, a genetic algorithm (GA) was developed to find the LCS. The performance was compared to the dynamic programming algorithm (DP A) using 84 test instances. The test instances consisted of three strings of lengths 100,200,400, 800, 1600, 3200, and 6400 with a known LCS of length 10%, 50%, 90%, and 100% of the string length for a total of 28 instances. All 28 of these instances were created for three types of strings: a binary alphabet (III = 2), a "DNA" alphabet (III = 4), and an English-alphabet (III = 26). The DPA always finds the LCS. The GA was set up to run until the best solution was the length of the known LCS. The algorithms were compared based on CPU time to find the LCS. Since the GA is not deterministic, it was run 30 times for a test instance. The best and mean times were measured, along with the standard deviation in the test run times.
The GA performed nearly as well as the DPA on shorter instances (length up to 400). For strings of length 800, the time required for the DPA increased dramatically and was considerably longer than the GA time. The DP A failed to run for strings longer than 800. The GA time increased much more slowly than the DPA as the string length grew. The times for the longest strings were still reasonable. The GA time was not affected by alphabet size. Other than the test instances where the LCS was 100% of the string length, the GA time increased as the length of the LCS increased. When the LCS is 100% of the string length, there are no poor starting solutions, and the algorithm only needs to grow the length of the solution. The GA was also run on a test instance containing four strings of different sizes.
In addition, the GA returns the LCS string along with the length. The DP A needs a second trip through its storage structure to extract the LCS string. The GA can handle any number of strings, and any string length. The set of strings do not need to be the same length, since the program bases its solution size on the shortest string in the set.
Recommended Citation
Hinkemeyer, Brenda, "A Genetic Algorithm for the Longest Common Subsequence Problem" (2006). Culminating Projects in Computer Science and Information Technology. 50.
https://repository.stcloudstate.edu/csit_etds/50
Comments/Acknowledgements
My deepest gratitude goes to Tom Weitzel and Alex Milowski for pushing me back into school. They coached me throughout the entire process as well. I am also grateful to my husband, Keith E. Hinkemeyer, and my children, David and Thomas, for their support; in particular, for the opportunity to pursue this degree full-time. I would like to acknowledge the encouragement and support I received from Dr. Bryant Julstrom and Dr. Jayantha Herath at St. Cloud State University. I also cannot forget Meghan, Kyra, Panda, Rye, Riika, Crazy, and Denali who put up with lack of attention the past couple of years, and dogs love to get attention.