Difference between revisions of "Hadoop"

(Created page with "== Description == ''From the Hadoop wiki:'' The '''Apache Hadoop''' software library is a framework that allows for the distributed processing of large data sets across clust...")
 
 
(2 intermediate revisions by the same user not shown)
Line 3: Line 3:
''From the Hadoop wiki:'' The '''Apache Hadoop''' software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
''From the Hadoop wiki:'' The '''Apache Hadoop''' software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.


== Alpha Cluster Status ==
=== Alpha Cluster Status ===


Currently, the [[Hadoop_Cluster|Hadoop Cluster]] is in the alpha-testing phase. As such: Research Computing will be able to provide only limited support with the Hadoop software.
Currently, the [[Hadoop_Cluster|Hadoop Cluster]] is offline. Please monitor this page for any change in its status.
 
== Version ==
 
*Cloudera Manager 5.5.1'''
*Hadoop 2.6.0'''
 
== Authorized Users ==
 
*Members of the [https://cwa.rc.usf.edu/cwa_groups/research-computing/show/hadoop-users Hadoop User’s Group]
** Access must be requested via a request to “Join” the [https://cwa.rc.usf.edu/cwa_groups/research-computing/show/hadoop-users Hadoop User’s Group]
 
== Platforms ==
 
*<code>CIRCE</code> Hadoop Cluster
 
== Running Hadoop on the CIRCE Hadoop Cluster ==
 
=== Example Hadoop Job ===
 
First, you will need a Hadoop-aware application to run within the Hadoop environment. For this example, we will use the WordCount Java program cited in the Hadoop documentation. Please follow the steps below to obtain and run WordCount.java against The Project Gutenberg EBook of '''''Ulysses''''', by James Joyce:
 
1) Once granted access to the Hadoop-Users group, [[Connecting_To_CIRCE| connect to CIRCE]] and then connect via SSH to the Hadoop primary node, hadoop.rc.usf.edu:
 
<pre style="white-space:pre-wrap; width:30%; border:1px solid lightgrey; background:#000000; color:white;">[user@login0 ~]$ ssh hadoop.rc.usf.edu
[user@wh-520-1-2 ~]$
</pre>
 
2) In your home directory, create a sub-directory called “hadoop_test”, and then change into that directory:
 
<pre style="white-space:pre-wrap; width:40%; border:1px solid lightgrey; background:#000000; color:white;">[user@wh-520-1-2 ~]$ mkdir ~/hadoop_test
[user@wh-520-1-2 ~]$ cd ~/hadoop_test</pre>
 
3) Obtain a copy of the input file (pg4300.txt) from http://www.gutenberg.org/cache/epub/4300/pg4300.txt
 
<pre style="white-space:pre-wrap; width:65%; border:1px solid lightgrey; background:#000000; color:white;">[user@wh-520-1-2 ~]$ wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt</pre>
 
4) Copy the input file (pg4300.txt) from your home directory into the HDFS filesystem
 
<pre style="white-space:pre-wrap; width:50%; border:1px solid lightgrey; background:#000000; color:white;">[user@wh-520-1-2 ~]$ hadoop fs -put ./pg4300.txt pg4300.txt</pre>
 
5) Copy the WordCount .jar file from HDFS to your ~/hadoop_test directory:
 
<pre style="white-space:pre-wrap; width:70%; border:1px solid lightgrey; background:#000000; color:white;">[user@wh-520-1-2 ~]$ hadoop fs -get /testing/wordcount.jar ~/hadoop_test/wordcount.jar</pre>
 
6) Running the hadoop task(s) here. I am specifying the jar, input, and output:
 
<pre style="white-space:pre-wrap; width:70%; border:1px solid lightgrey; background:#000000; color:white;">[user@wh-520-1-2 ~]$ hadoop jar ./wordcount.jar org.myorg.WordCount pg4300.txt output</pre>
 
7) Copying the output files from the HDFS filesystem to your ~/hadoop_test directory
 
<pre style="white-space:pre-wrap; width:55%; border:1px solid lightgrey; background:#000000; color:white;">[user@wh-520-1-2 ~]$ hadoop fs -get output hadoop-output-pg4300</pre>
 
With the above run, the result data would be stored in files ~/hadoop_test/hadoop_output-pg4300/part-*. A sample of the output is below:
 
<pre style="white-space:pre-wrap; width:45%; border:1px solid lightgrey; background:#E0E0E0; color:black;">
Andalusian      2
Anderson        1
Anderson's      2
Anderson,      1
Andrew  3
Andrew's        1
Andrew, 1
Andrews,        1
Andrews.        2
Andromeda      1
Andy,  1
Anemic  1
Angels  1
</pre>
 
{{Documentation}}
*{{PAGENAME}} Home Page:
**https://hadoop.apache.org
*{{PAGENAME}} Documentation:
**https://hadoop.apache.org/docs/stable/
*{{PAGENAME}} HDFS Command Overview:
**https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html


{{AppStandardFooter}}
{{AppStandardFooter}}

Latest revision as of 14:16, 21 February 2017

Description

From the Hadoop wiki: The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Alpha Cluster Status

Currently, the Hadoop Cluster is offline. Please monitor this page for any change in its status.

More Job Information

See the following for more detailed job submission information:

Reporting Bugs

Report bugs with Hadoop to the IT Help Desk: rc-help@usf.edu