Culminating Project Title
Date of Award
Culminating Project Type
Information Assurance: M.S.
Information Assurance and Information Systems
Herberger School of Business
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Keywords and Subject Headings
MapReduce, Computation, optimization, Java
Today, the amount of data generated is extremely large and is growing faster than computational speeds can keep up with. Therefore, using the traditional ways or we can say using a single machine to store or process data can no longer be beneficial and can take a huge amount of time. As a result, we need a different and better way to process data such as having data distributed over large computing clusters.
Hadoop is a framework that allows the distributed processing of large data sets. Hadoop is an open source application available under the Apache License. It is designed to scale up from a single server to thousands of machines, where each machine can perform computations locally and store them.
The literature indicates that processing Big Data in a reasonable time frame can be a challenging task. One of the most promising platforms is a concept of Exascale computing. This paper created a testbed based on recommendations for Big Data within the Exascale architecture. This testbed featured three nodes, Hadoop distributed file system. Data from Twitter logs was stored in both the Hadoop file system as well as a traditional MySQL database. The Hadoop file system consistently outperformed the MySQL database. The further research uses larger data sets and more complex queries to truly assess the capabilities of distributed file systems. This research also addresses optimizing the number of processing nodes and the intercommunication paths in the underlying infrastructure of the distributed file system.
HIVE.apache.org states that the Apache HIVE data warehouse software facilitates reading, writing, and managing large datasets residing in distributes storage using SQL. At the end, there is an explanation of how to install and launch Hadoop and HIVE, how to configure the rules in a Hadoop ecosystem and the few use cases to check the performance.
Sultana, Afreen, "Using Hadoop to Support Big Data Analysis: Design and Performance Characteristics" (2015). Culminating Projects in Information Assurance. 27.