The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Date of Award


Culminating Project Type

Starred Paper

Degree Name

Computer Science: M.S.


Computer Science and Information Technology


School of Science and Engineering

First Advisor

Donald Hamnes

Second Advisor

Jie Meichsner

Third Advisor

Dennis Guster

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Keywords and Subject Headings

Performance, Comparison, Hadoop, Single Machine, External Sort, Optimizing


With the coming of a big data era, Hadoop, developed by Doug Cutting and Mike Cafarella, was presented in 2005 [1], which turned over a new page in the history of cloud computing. The Hadoop Distributed File System (HDFS) is one of the most fundamental layers in Hadoop. In the big data world, the performance of dealing with big data from HDFS cannot satisfy the need because the amount of big data is getting larger and larger, and simultaneously, the increasing rate of growth of big data is faster and faster. Nowadays various new distributed file systems (DFS) are published attempting to solve this issue. The core problem hindering the performance from becoming more effective is the metadata service layer in HDFS, and most of the new DFSs are focusing on improving the metadata service as well. Most of the above-mentioned cases are centering on the issue of solving the big data problem. However, for a small or medium-sized company, the data they may use is not so big. In this case, do they need to build a distributed system to deal with their data? Of course, the data in these companies will be getting larger and larger. When will be the best time for them to need a distributed system to manage their data? This paper attempts to address this problem by comparing the different performances between a distributed system computation and a serial computation.


I would like to thank my advisor Dr. Hamnes for offering a lot of valuable suggestions to my work. Without his guidance and assistance, it could not have been possible that my whole progress has gone so smoothly. Also, I would express my deep appreciation and indebtedness to the committee members—Dr. Meichsner and Dr. Guster, who contributed their time and energy in modifying my paper and providing insightful suggestions. My sincere appreciation also goes to Martin Smith, who provided great support in helping me set up the hardware and system environment for the laboratory work. Finally, my family, Hailei and Monica, gave me a huge support for this research.

ExternalSortCode.pdf (1443 kB)
External Java code for lab experiment



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.