hadoop

Companies are processed in parallel manner according to the MapReduce programming model. To this end, primitive data of companies, departments, and employees can be stored in files of fixed-size records in a distributed file system.

Illustration

While data is stored in files, the records can be de-serialized into objects for the convenience of MapReduce functionality. For instance, an object type for employees is designed as follows:

<source lang="java"> class Employee implements WritableComparable<Employee> { private Text name; private Text address; private DoubleWritable salary; private Text company; // getters, setters and omitted public void readFields(DataInput in) throws IOException { name = new Text(); name.readFields(in); address = new Text(); address.readFields(in); ... } } </source>

That is, there are properties for name, address, and salary---as usual. In addition, there is a property for the the company so that the company of each employee is immediately known without any traversal or state-based effort. Objects are populated from records on file through a readFields method that is required for any deserializable type.

A MapReduce computation consists of a mapper and a reducer. The essential methods of these components, i.e., methods map (extraction) and reduce (aggregation) are shown below:

<source lang="java"> protected void map( Text key, Employee value, Context context) throws ... { context.write( value.getCompany(), value.getSalary()); } </source>

<source lang="java"> protected void reduce( Text key, Iterable<...> values, Context context) throws ... { double total = 0; for(DoubleWritable value: values) total += value.get(); context.write(key, new DoubleWritable(total)); } </source>

That is, the map method constructs an intermediate key-value pair from each employee such that the company of an employee (say, the company name) serves as key and the employee's salary serves as value. In this manner, the MapReduce framework will correctly group together all salaries per company. Hence, the reduce method simply iterates over all salaries, grouped by key, and sums them up by a trivial aggregation loop so that a pair of the company key with the total of salaries is written to the output file.

Architecture

Package org.softlang.company hosts the object model for Feature:Hierarchical_company. Package org.softlang.operations hosts designated classes with static methods for the MapReduce jobs Feature:Total and Feature:Cut. Some boilerplate code for closed serialization is implemented in the class org.softlang.company.Company (see methods readObject and writeObject). Package org.softlang.tests hosts JUnit tests; see below.

Usage

The implementation is provided as an Eclipse project.
Hence, open the project with Eclipse; this will also build the project.
The default settings runs Hadoop on your local machine. For distributed setup see below.
There are JUnit tests available as the package org.softlang.tests.
- Run class Serialization with JUnit to create and serialize an example Company.
- Run class Basics with JUnit to exercise basic features.

Distributed Setup

An official release for Hadoop can be downloaded here: http://hadoop.apache.org/common/releases.html The Hadoop Wiki describes how to set up Hadoop in local (http://hadoop.apache.org/common/docs/current/single node setup.html) and distributed mode (http://hadoop.apache.org/common/docs/current/cluster setup.html).

These configurations require jobs being run via the command line using a jar file. The jar can be build from our Eclipse project by simply specifying the class OperationRunner as main class.

Using Hadoop in distributed mode under Eclipse is not that trivial. Instructions can be found here: http://wiki.apache.org/hadoop/EclipseEnvironment

Possible issues

Running Hadoop under Windows requires Cygwin being installed. Instructions can be found here: http://wiki.apache.org/hadoop/GettingStartedWithHadoop

Hadoop also requires ssh to access all machines (including localhost) without password prompt. Instructions how to do this can be found for example here: http://www.maths.qmul.ac.uk/~dhruba/tips and tricks/node2.html