Hadoop
Description
From the Hadoop wiki: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Alpha Cluster Status
Currently, the Hadoop Cluster is in the alpha-testing phase. As such: Research Computing will be able to provide only limited support with the Hadoop software.
Version
- Cloudera Manager 5.5.1
- Hadoop 2.6.0
Authorized Users
- Members of the Hadoop User’s Group
- Access must be requested via a request to “Join” the Hadoop User’s Group
Platforms
CIRCE
Hadoop Cluster
Running Hadoop on the CIRCE Hadoop Cluster
Example Hadoop Job
First, you will need a Hadoop-aware application to run within the Hadoop environment. For this example, we will use the WordCount Java program cited in the Hadoop documentation. Please follow the steps below to obtain and run WordCount.java against The Project Gutenberg EBook of Ulysses, by James Joyce:
1) Once granted access to the Hadoop-Users group, connect to CIRCE and then connect via SSH to the Hadoop primary node, hadoop.rc.usf.edu:
[user@login0 ~]$ ssh hadoop.rc.usf.edu [user@wh-520-1-2 ~]$
2) In your home directory, create a sub-directory called “hadoop_test”, and then change into that directory:
[user@wh-520-1-2 ~]$ mkdir ~/hadoop_test [user@wh-520-1-2 ~]$ cd ~/hadoop_test
3) Obtain a copy of the input file (pg4300.txt) from http://www.gutenberg.org/cache/epub/4300/pg4300.txt
[user@wh-520-1-2 ~]$ wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt
4) Copy the input file (pg4300.txt) from your home directory into the HDFS filesystem
[user@wh-520-1-2 ~]$ hadoop fs -put ./pg4300.txt pg4300.txt
5) Copy the WordCount .jar file from HDFS to your ~/hadoop_test directory:
[user@wh-520-1-2 ~]$ hadoop fs -get /testing/wordcount.jar ~/hadoop_test/wordcount.jar
6) Running the hadoop task(s) here. I am specifying the jar, input, and output:
[user@wh-520-1-2 ~]$ hadoop jar ./wordcount.jar org.myorg.WordCount pg4300.txt output
7) Copying the output files from the HDFS filesystem to your ~/hadoop_test directory
[user@wh-520-1-2 ~]$ hadoop fs -get output hadoop-output-pg4300
With the above run, the result data would be stored in files ~/hadoop_test/hadoop_output-pg4300/part-*. A sample of the output is below:
Andalusian 2 Anderson 1 Anderson's 2 Anderson, 1 Andrew 3 Andrew's 1 Andrew, 1 Andrews, 1 Andrews. 2 Andromeda 1 Andy, 1 Anemic 1 Angels 1
Documentation
Home Page, User Guides, and Manuals
- Hadoop Home Page:
- Hadoop Documentation:
- Hadoop HDFS Command Overview:
More Job Information
See the following for more detailed job submission information:
Reporting Bugs
Report bugs with Hadoop to the IT Help Desk: rc-help@usf.edu