Headline

Data-parallel processing with Hadoop

Motivation

Companies are processed in parallel manner according to the MapReduce programming model. To this end, primitive data of companies, departments, and employees can be stored in files of fixed-size records in a distributed file system.

Illustration

While data is stored in files, the records can be de-serialized into objects for the convenience of MapReduce functionality. For instance, an object type for employees is designed as follows:

<source lang="java"> class Employee implements WritableComparable<Employee> { private Text name; private Text address; private DoubleWritable salary; private Text company; // getters, setters and omitted public void readFields(DataInput in) throws IOException { name = new Text(); name.readFields(in); address = new Text(); address.readFields(in); ... } } </source>

That is, there are properties for name, address, and salary---as usual. In addition, there is a property for the the company so that the company of each employee is immediately known without any traversal or state-based effort. Objects are populated from records on file through a readFields method that is required for any deserializable type.

A MapReduce computation consists of a mapper and a reducer. The essential methods of these components, i.e., methods map (extraction) and reduce (aggregation) are shown below:

<source lang="java"> protected void map( Text key, Employee value, Context context) throws ... { context.write( value.getCompany(), value.getSalary()); } </source>

<source lang="java"> protected void reduce( Text key, Iterable<...> values, Context context) throws ... { double total = 0; for(DoubleWritable value: values) total += value.get(); context.write(key, new DoubleWritable(total)); } </source>

That is, the map method constructs an intermediate key-value pair from each employee such that the company of an employee (say, the company name) serves as key and the employee's salary serves as value. In this manner, the MapReduce framework will correctly group together all salaries per company. Hence, the reduce method simply iterates over all salaries, grouped by key, and sums them up by a trivial aggregation loop so that a pair of the company key with the total of salaries is written to the output file.

Architecture

Package org.softlang.company hosts the object model for Feature:Hierarchical_company. Package org.softlang.operations hosts designated classes with static methods for the MapReduce jobs Feature:Total and Feature:Cut. Some boilerplate code for closed serialization is implemented in the class org.softlang.company.Company (see methods readObject and writeObject). Package org.softlang.tests hosts JUnit tests; see below.

Usage

  • The implementation is provided as an Eclipse project.
  • Hence, open the project with Eclipse; this will also build the project.
  • The default settings runs Hadoop on your local machine. For distributed setup see below.
  • There are JUnit tests available as the package org.softlang.tests.
    • Run class Serialization with JUnit to create and serialize an example Company.
    • Run class Basics with JUnit to exercise basic features.

Distributed Setup

An official release for Hadoop can be downloaded here: http://hadoop.apache.org/common/releases.html The Hadoop Wiki describes how to set up Hadoop in local (http://hadoop.apache.org/common/docs/current/single node setup.html) and distributed mode (http://hadoop.apache.org/common/docs/current/cluster setup.html).

These configurations require jobs being run via the command line using a jar file. The jar can be build from our Eclipse project by simply specifying the class OperationRunner as main class.

Using Hadoop in distributed mode under Eclipse is not that trivial. Instructions can be found here: http://wiki.apache.org/hadoop/EclipseEnvironment

Possible issues

Running Hadoop under Windows requires Cygwin being installed. Instructions can be found here: http://wiki.apache.org/hadoop/GettingStartedWithHadoop

Hadoop also requires ssh to access all machines (including localhost) without password prompt. Instructions how to do this can be found for example here: http://www.maths.qmul.ac.uk/~dhruba/tips and tricks/node2.html

Metadata


There are no revisions for this page.

User contributions

    This user never has never made submissions.

    User edits

    Syntax for editing wiki

    For you are available next options:

    will make text bold.

    will make text italic.

    will make text underlined.

    will make text striked.

    will allow you to paste code headline into the page.

    will allow you to link into the page.

    will allow you to paste code with syntax highlight into the page. You will need to define used programming language.

    will allow you to paste image into the page.

    is list with bullets.

    is list with numbers.

    will allow your to insert slideshare presentation into the page. You need to copy link to presentation and insert it as parameter in this tag.

    will allow your to insert youtube video into the page. You need to copy link to youtube page with video and insert it as parameter in this tag.

    will allow your to insert code snippets from @worker.

    Syntax for editing wiki

    For you are available next options:

    will make text bold.

    will make text italic.

    will make text underlined.

    will make text striked.

    will allow you to paste code headline into the page.

    will allow you to link into the page.

    will allow you to paste code with syntax highlight into the page. You will need to define used programming language.

    will allow you to paste image into the page.

    is list with bullets.

    is list with numbers.

    will allow your to insert slideshare presentation into the page. You need to copy link to presentation and insert it as parameter in this tag.

    will allow your to insert youtube video into the page. You need to copy link to youtube page with video and insert it as parameter in this tag.

    will allow your to insert code snippets from @worker.