Date of Award
5-2015
Culminating Project Type
Starred Paper
Degree Name
Information Assurance: M.S.
Department
Information Assurance and Information Systems
College
Herberger School of Business
First Advisor
Dennis Guster
Second Advisor
Susantha Herath
Third Advisor
David Robinson
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Keywords and Subject Headings
MapReduce, Computation, optimization, Java
Abstract
Today, the amount of data generated is extremely large and is growing faster than computational speeds can keep up with. Therefore, using the traditional ways or we can say using a single machine to store or process data can no longer be beneficial and can take a huge amount of time. As a result, we need a different and better way to process data such as having data distributed over large computing clusters.
Hadoop is a framework that allows the distributed processing of large data sets. Hadoop is an open source application available under the Apache License. It is designed to scale up from a single server to thousands of machines, where each machine can perform computations locally and store them.
The literature indicates that processing Big Data in a reasonable time frame can be a challenging task. One of the most promising platforms is a concept of Exascale computing. This paper created a testbed based on recommendations for Big Data within the Exascale architecture. This testbed featured three nodes, Hadoop distributed file system. Data from Twitter logs was stored in both the Hadoop file system as well as a traditional MySQL database. The Hadoop file system consistently outperformed the MySQL database. The further research uses larger data sets and more complex queries to truly assess the capabilities of distributed file systems. This research also addresses optimizing the number of processing nodes and the intercommunication paths in the underlying infrastructure of the distributed file system.
HIVE.apache.org states that the Apache HIVE data warehouse software facilitates reading, writing, and managing large datasets residing in distributes storage using SQL. At the end, there is an explanation of how to install and launch Hadoop and HIVE, how to configure the rules in a Hadoop ecosystem and the few use cases to check the performance.
Recommended Citation
Sultana, Afreen, "Using Hadoop to Support Big Data Analysis: Design and Performance Characteristics" (2015). Culminating Projects in Information Assurance. 27.
https://repository.stcloudstate.edu/msia_etds/27
Comments/Acknowledgements
I would first like to thank my starred Paper Advisor Dr. Dennis Guster of the Department of Information Systems at Saint Cloud State University. The door to Prof. Guster’s Office was always open whenever I ran into a trouble spot or had a question about my research. He consistently allowed this paper to be my own work but steered me in the right direction whenever he thought I needed it.
I would also like to show my gratitude to Dr. Susantha Herath, Chair Department of Information Systems and Dr. David H. Robinson, Prof. Statistics Department for sharing their pearls of wisdom with me during this research. I am immensely grateful for their comments on an earlier version of the manuscript, although any errors are my own and should not tarnish the reputations of these esteemed persons.