Date of Award
5-2017
Culminating Project Type
Starred Paper
Degree Name
Computer Science: M.S.
Department
Computer Science and Information Technology
College
School of Science and Engineering
First Advisor
Jie H. Meichsner
Second Advisor
Omar Al - Azzam,
Third Advisor
Dennis C. Guster
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Keywords and Subject Headings
unstructured data, web server log files, Apache Pig, HDFS
Abstract
Data extraction and analysis have recently received significant attention due to the evolution of social media and large volume of data available in an unstructured form. Hadoop and MapReduce have been continuously implementing and analyzing large amount of data. In this paper Apache Pig, which is one of the high-level platform for analyzing large volume of data and runs on the top of Hadoop is used to analyze unstructured log files and extract information. In this paper, weblog server files are used to analyze and extract meaningful information in an unstructured form to a structured form in Apache Pig framework The main purpose of this paper is to extract, transform and load unstructured data in an Apache Pig framework and analyze the data and its performance on local mode as well as MapReduce mode. This paper further explains in brief about the different steps required to analyze unstructured web server log files in Apache Pig. This paper also compares the efficiency when a large volume of data is processed on MapReduce mode and local mode.
Recommended Citation
Niraula, Neeta, "Web Log Data Analysis: Converting Unstructured Web Log Data into Structured Data Using Apache Pig" (2017). Culminating Projects in Computer Science and Information Technology. 19.
https://repository.stcloudstate.edu/csit_etds/19
Comments/Acknowledgements
I would like to express my sincere gratitude to Dr. Dennis C. Guster, Professor, Department of Information Systems for allowing me to undertake this work. I am grateful to my advisor and supervisor, Professor Dr. Jie H. Meichsner, Department of Computer Science Information and Technology, for her continuous guidance, advice effort, and invertible suggestion throughout the research. I am also grateful to my supervisor Dr. Omar Al-Azzam, Professor of Computer Science and Information Technology, for providing me the logistic support and his valuable suggestion to carry out my research successfully. I would also like to thank lab consultants of the Department of Information Systems for helping to carry out my research. I would also like to thank my friends of Computer Science for their help throughout the study. Lastly, I would like to express my sincere appreciation to my family, especially my husband, for encouraging and supporting me throughout the study.