The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Date of Award


Culminating Project Type

Starred Paper

Degree Name

Information Assurance: M.S.


Information Assurance and Information Systems


Herberger School of Business

First Advisor

Dennis Guster

Second Advisor

Lynn Collen

Third Advisor

Keith Ewing

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Keywords and Subject Headings

hadoop, java, distributed, sqoop, hive, mysql


Every day there is an exponential increase of information and this data must be stored and analyzed. Traditional data warehousing solutions are expensive. Apache Hadoop is a popular open source data store which implements map-reduce concepts to create a distributed database architecture. In this paper, a performance analysis project was devised that compares Apache Hive, which is built on top of Apache Hadoop, with a traditional database such as MySQL. Hive supports HiveQueryLanguage, a SQL like directive language which implements MapReduce jobs. These jobs can then be executed using Hadoop. Hive also has a system catalog – Metastore which is used to index data components. The Hadoop framework is developed to include a duplication detection system which helps managing multiple copies of the same data at the file level. The Java Server Pages and Java Servlet framework were used to build a Java web application to provide a web interface for the clients to access and analyze large data sets present in Apache Hive or MySQL databases.


This research paper about designing and implementing Java web applications to interact with data stored in a distributed file system was undertaken using resources provided by the Business Computing Research Laboratory of St. Cloud State University. Data used for the analyses came from the St. Cloud State University library.