The Repository @ St. Cloud State

Open Access Knowledge and Scholarship

Date of Award


Culminating Project Type

Starred Paper

Degree Name

Information Assurance: M.S.


Information Assurance and Information Systems


Herberger School of Business

First Advisor

Dennis Guster

Second Advisor

Susantha Herath

Third Advisor

Balasubramanian Kasi

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.


The exponential development of data initially exhibited difficulties for prominent organizations, for example, Google, Yahoo, Amazon, Microsoft, Facebook, Twitter and so forth. The size of the information that needs to be handled by cloud applications is developing significantly quicker than storage capacity. This development requires new systems for managing and breaking down data. The term "Big Data" is used to address large volumes of unstructured (or semi-structured) and structured data that gets created from different applications, messages, weblogs, and online networking.

Big Data is data whose size, variety and uncertainty require new supplementary models, procedures, algorithms, and research to manage and extract value and concealed learning from it. To process more information efficiently and skillfully, for analysis parallelism is utilized. To deal with the unstructured and semi-structured information NoSQL database has been presented. Hadoop better serves the Big Data analysis requirements. It is intended to scale up starting from a single server to a large cluster of machines, which has a high level of adaptation to internal failure.

Many business and research institutes such as Facebook, Yahoo, Google, and so on had an expanding need to import, store, and analyze dynamic semi-structured data and its metadata. Also, significant development of semi-structured data inside expansive web-based organizations has prompted the formation of NoSQL data collections for flexible sorting and MapReduce for adaptable parallel analysis. They assessed, used and altered Hadoop, the most popular open source execution of MapReduce, for tending to the necessities of various valid analytics problems. These institutes are also utilizing MongoDB, and a report situated NoSQL store. In any case, there is a limited comprehension of the execution trade-offs of using these two innovations. This paper assesses the execution, versatility, and adaptation to an internal failure of utilizing MongoDB and Hadoop, towards the objective of recognizing the correct programming condition for logical data analytics and research. Lately, an expanding number of organizations have developed diverse, distinctive kinds of non-relational databases (such as MongoDB, Cassandra, Hypertable, HBase/ Hadoop, CouchDB and so on), generally referred to as NoSQL databases. The enormous amount of information generated requires an effective system to analyze the data in various scenarios, under various breaking points. In this paper, the objective is to find the break-even point of both Hadoop/Pig and MongoDB and develop a robust environment for data analytics.