The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Date of Award

10-2016

Culminating Project Type

Starred Paper

Degree Name

Information Assurance: M.S.

Department

Information Assurance and Information Systems

College

Herberger School of Business

First Advisor

Dennis Guster

Second Advisor

Susantha Herath

Third Advisor

Balasubramanian Kasi

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Abstract

The storage and processing of data are major issues in information technology today. Every organization has been rapidly growing data day by day, and it becomes tough for the information systems to process and respond to the various queries required of them. Banking is one such industry which needs to handle millions of data records each time. Utilizing Hadoop as a solution is one way to handle these records more effectively and in less time. From this Proof of Concept (POC), the time difference between executing queries will take much less compared to the existing database system. The growth of data challenges cutting-edge companies like Google, Yahoo, Amazon, Microsoft and many more like them. They need to go through the terabytes and even petabytes of data to figure out issues regarding these websites which are popular among people. The tools they had at the time were not equipped to cope with this issue. Then Google presented MapReduce, a system they had used to cope with this issue. The majority of companies were facing the same issue as Google, so they did not want to develop another system like Google developed, and this system was suitable for all of them. After some time, this system became open source for all of them, and many companies appreciated this effort. That system was named as Hadoop, and today it is major part of the computing world. Due to its efficiency, many more companies are going to rely on Hadoop, and they are going to establish this system in their companies. Hadoop is used for running huge distributed programs so its simplicity and accessibility give it an edge over writing and running distributed programs. Any good programmer can create his own Hadoop instance in minutes, and it is also very cheap to create. Hadoop is moreover, very scalable and robust. Due to Hadoop’s features, it is getting very popular in the academic and industrial world. MapReduce is a model of data processing and in this model, data can easily be scalable over multiple systems. In this model, two terms are used for data processing, and those are mappers and reducers. Sometimes it is nontrivial to decompose the data application into mappers and into reducers. However, once you write an application in the MapReduce format then scaling of that application to run over many hundreds of systems is not a big issue. Some minor changes may still be required to take place, however due to its efficiency and scalability, programmers are attracted towards MapReduce like a bear towards honey. According to experts, this era is an era of development of unbelievable things, and these developments require large systems with larger data storage in them to cope with the immense storage issues. Hadoop plays an effective role to cope with this issue with its scalability and many more striking features. Hadoop is also an astonishing development. There is a challenge that must be fulfilled and that is how the existing data will move to the Hadoop infrastructure, when the existing data infrastructure is based on traditional relational database and Structured Query Language (SQL). Meanwhile there is the concept of Hive. Hive provides a dialect of SQL named as Hive Query Language to fulfil the query of data storage in the cluster of Hadoop instances. Hive does not work as a database, instead it is bound to the limitations imposed by the constraints of Hadoop. The most surprising limitation is that it cannot provide record level updates, such as insert and delete. You can only make new tables, or you can perform queries to output results to files. Hive also does not provide transactional data.

Share

COinS