Publication Date

Spring 2015

Document Type

Project Summary

Degree Name

Master of Science


Computer Science

First Advisor

(Clare) Xueqing Tang, Ph.D.

Second Advisor

Soon-Ok Park, Ph.D.

Third Advisor

Kong-Cheng Wong, Ph.D.


Data is large and vast, with more data coming into the system every day. Summarization analytics are all about grouping similar data together and then performing an operation such as calculating a statistic, building an index, or just simply counting.

Filtering is more about understanding a smaller piece of your data, such as all records generated from a particular user, or the top ten most used verbs in a corpus of text. In short, filtering allows you to apply a microscope to your data. It can also be considered a form of search.

Hadoop allows us to modify the way data is loaded on disk in two major ways: configuring how contiguous chunks of input are generated from blocks in, and configuring how records appear in the map phase. The two classes you’ll be playing with to do this are Record Reader and Input Format. These work with the Hadoop MapReduce framework in a very similar way to how mappers and Reducers are plugged in.

This is about the analytics side of Hadoop or MapReduce. Computation in Hadoop MapReduce is performed in parallel, automatically, with a simple abstraction for developers that obviate complex synchronization and network programming. Unlike many other distributed data processing systems, Hadoop runs the user-provided processing logic on the machine where the data lives rather than dragging the data across the network; a huge win for performance.

At Q&A sites such as Experts exchange service developed and the number of users grew from thousands to millions, storing, processing, and managing all the incoming data became increasingly challenging.

There were several reasons for adopting Hadoop:

  • The distributed file system provided redundant backups for the data stored on it at no extra cost.
  • Scalability was simplified through the ability to add cheap, commodity hardware when required.
  • Hadoop provided a flexible framework for running distributed computing algorithms with a relatively easy learning curve.

Hadoop can be used to form core backend batch and near real-time computing infrastructures. It can also be used to store and archive massive datasets.


Co-authored capstone with authors listed in alphabetical order by OPUS staff.