Hadoop

Revision as of 20:17, 8 July 2016 by Botto (talk | contribs) (Created page with "== Description == ''From the Hadoop wiki:'' The '''Apache Hadoop''' software library is a framework that allows for the distributed processing of large data sets across clust...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Description

From the Hadoop wiki: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Alpha Cluster Status

Currently, the Hadoop Cluster is in the alpha-testing phase. As such: Research Computing will be able to provide only limited support with the Hadoop software.

Version

  • Cloudera Manager 5.5.1
  • Hadoop 2.6.0

Authorized Users

Platforms

  • CIRCE Hadoop Cluster

Running Hadoop on the CIRCE Hadoop Cluster

Example Hadoop Job

First, you will need a Hadoop-aware application to run within the Hadoop environment. For this example, we will use the WordCount Java program cited in the Hadoop documentation. Please follow the steps below to obtain and run WordCount.java against The Project Gutenberg EBook of Ulysses, by James Joyce:

1) Once granted access to the Hadoop-Users group, connect to CIRCE and then connect via SSH to the Hadoop primary node, hadoop.rc.usf.edu:

[user@login0 ~]$ ssh hadoop.rc.usf.edu
[user@wh-520-1-2 ~]$ 

2) In your home directory, create a sub-directory called “hadoop_test”, and then change into that directory:

[user@wh-520-1-2 ~]$ mkdir ~/hadoop_test
[user@wh-520-1-2 ~]$ cd ~/hadoop_test

3) Obtain a copy of the input file (pg4300.txt) from http://www.gutenberg.org/cache/epub/4300/pg4300.txt

[user@wh-520-1-2 ~]$ wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt

4) Copy the input file (pg4300.txt) from your home directory into the HDFS filesystem

[user@wh-520-1-2 ~]$ hadoop fs -put ./pg4300.txt pg4300.txt

5) Copy the WordCount .jar file from HDFS to your ~/hadoop_test directory:

[user@wh-520-1-2 ~]$ hadoop fs -get /testing/wordcount.jar ~/hadoop_test/wordcount.jar

6) Running the hadoop task(s) here. I am specifying the jar, input, and output:

[user@wh-520-1-2 ~]$ hadoop jar ./wordcount.jar org.myorg.WordCount pg4300.txt output

7) Copying the output files from the HDFS filesystem to your ~/hadoop_test directory

[user@wh-520-1-2 ~]$ hadoop fs -get output hadoop-output-pg4300

With the above run, the result data would be stored in files ~/hadoop_test/hadoop_output-pg4300/part-*. A sample of the output is below:

Andalusian      2
Anderson        1
Anderson's      2
Anderson,       1
Andrew  3
Andrew's        1
Andrew, 1
Andrews,        1
Andrews.        2
Andromeda       1
Andy,   1
Anemic  1
Angels  1

Documentation

Home Page, User Guides, and Manuals

More Job Information

See the following for more detailed job submission information:

Reporting Bugs

Report bugs with Hadoop to the IT Help Desk: rc-help@usf.edu