The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Date of Award


Culminating Project Type

Starred Paper

Degree Name

Information Assurance: M.S.


Information Assurance and Information Systems


Herberger School of Business

First Advisor

Dennis Guster

Second Advisor

Susantha Herath

Third Advisor

David Robinson

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Keywords and Subject Headings

MapReduce, Computation, optimization, Java


Today, the amount of data generated is extremely large and is growing faster than computational speeds can keep up with. Therefore, using the traditional ways or we can say using a single machine to store or process data can no longer be beneficial and can take a huge amount of time. As a result, we need a different and better way to process data such as having data distributed over large computing clusters.

Hadoop is a framework that allows the distributed processing of large data sets. Hadoop is an open source application available under the Apache License. It is designed to scale up from a single server to thousands of machines, where each machine can perform computations locally and store them.

The literature indicates that processing Big Data in a reasonable time frame can be a challenging task. One of the most promising platforms is a concept of Exascale computing. This paper created a testbed based on recommendations for Big Data within the Exascale architecture. This testbed featured three nodes, Hadoop distributed file system. Data from Twitter logs was stored in both the Hadoop file system as well as a traditional MySQL database. The Hadoop file system consistently outperformed the MySQL database. The further research uses larger data sets and more complex queries to truly assess the capabilities of distributed file systems. This research also addresses optimizing the number of processing nodes and the intercommunication paths in the underlying infrastructure of the distributed file system. states that the Apache HIVE data warehouse software facilitates reading, writing, and managing large datasets residing in distributes storage using SQL. At the end, there is an explanation of how to install and launch Hadoop and HIVE, how to configure the rules in a Hadoop ecosystem and the few use cases to check the performance.


I would first like to thank my starred Paper Advisor Dr. Dennis Guster of the Department of Information Systems at Saint Cloud State University. The door to Prof. Guster’s Office was always open whenever I ran into a trouble spot or had a question about my research. He consistently allowed this paper to be my own work but steered me in the right direction whenever he thought I needed it.

I would also like to show my gratitude to Dr. Susantha Herath, Chair Department of Information Systems and Dr. David H. Robinson, Prof. Statistics Department for sharing their pearls of wisdom with me during this research. I am immensely grateful for their comments on an earlier version of the manuscript, although any errors are my own and should not tarnish the reputations of these esteemed persons.