Difference between revisions of "Hadoop Cluster"

 
(6 intermediate revisions by the same user not shown)
Line 5: Line 5:
The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.
The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.


== Available Software: ==
{| class=wikitable
|avro-tools
|llama
|sqoop-export
|-
|beeline
|llamaadmin
|sqoop-help
|-
|bigtop-detect-javahome
|load_gen
|sqoop-import
|-
|catalogd
|mahout
|sqoop-import-all-tables
|-
|cli_mt
|mapred
|sqoop-job
|-
|cli_st
|oozie
|sqoop-list-databases
|-
|flume-ng
|oozie-setup
|sqoop-list-tables
|-
|hadoop
|parquet-tools
|sqoop-merge
|-
|hadoop-0.20
|pig
|sqoop-metastore
|-
|hadoop-fuse-dfs
|pyspark
|sqoop-version
|-
|hadoop-fuse-dfs.orig
|sentry
|statestored
|-
|hbase
|solrctl
|whirr
|-
|hbase-indexer
|spark-executorspark-shell
|yarn
|-
|hcat
|spark-submit
|zookeeper-client
|-
|hdfs
|sqoop
|zookeeper-server
|-
|hive
|sqoop2
|zookeeper-server-cleanup
|-
|hiveserver2
|sqoop2-server
|zookeeper-server-initialize
|-
|impalad
|sqoop2-tool
|
|-
|impala-shell
|sqoop-codegen
|
|-
|kite-dataset
|sqoop-create-hive-table
|
|-
|
|sqoop-eval
|
|}


== Alpha Cluster Status: ==
== Alpha Cluster Status: ==


Currently, the Hadoop Cluster is in the alpha-testing phase. As such: Research Computing will be able to only provide limited support with the Hadoop software.
Currently, the Hadoop Cluster is offline. Please monitor this page for any change in its status.
 
== User Interfaces: ==
 
SSH:
 
To run jobs on the Hadoop Cluster, you will first need to login to the Hadoop MasterNode using the following command:
 
<pre style="white-space:pre-wrap; width:25%; border:1px solid lightgrey; background:#000000; color:white;">
ssh hadoop.rc.usf.edu</pre>
**Please note that only individual who have been granted access to the Hadoop Cluster will be able to connect.
 
== Cluster Specifications: ==
 
<pre style="white-space:pre-wrap; width:40%; border:1px solid lightgrey; background:#000000; color:white;">
Configured Capacity: 15419163328512 (14.02 TB)
Replication Factor: 3
Scheduling: YARN Fair Scheduling</pre>
 
== File System Specifications: ==
 
The Hadoop Cluster uses a distributed file system necessary known as Hadoop Distributed File System (HDFS). This file system, along with the Hadoop MapReduce framework, defines Hadoop as a data analysis system. HDFS is a distrbuted file system, and is generally used to house both input and output files for Hadoop. The HDFS can be accessed using the following command:
 
Hadoop fs -”command” Where, “command” is the appropriate command that you wish to run with. The full list of Hadoop filesystem (fs) commands can be studied here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
 
'''What the Hadoop File System IS:'''
 
* A distributed file system
 
* A means of storing input and output data for your Apache Hadoop jobs
 
'''What the Hadoop File System IS NOT:'''
 
* A File System for developing code, compiling code, or otherwise running non-Hadoop application
 
* A Fully functional File System
 
== Why Use HDFS? ==
 
It may seem odd to use an entire distributed file system for simply storing input and output, however, there are a few reasons why this is necessary for the Hadoop system. Hadoop increases performance in the analysis of data by storing data locally on nodes, and running tasks in parallel on this data locally. This type of data locality eliminates any issues that may arise in transmitting large quantities of data across an entire cluster during computation. The HDFS is just one layer of abstraction between the user running jobs, and the data which has been distributed on the cluster. Essentially, by uploading input data into the HDFS, you are distributing this data in small chunks across the entire Hadoop cluster. The Hadoop file system allows you to see these blocks as one file, and run your specified jobs on these blocks without navigating through every blocks and assigning tasks individually.
 
**Please refer to our individual Hadoop documentation for information on running jobs using input located in the HDFS.
 
Ch2. Current Machine Roles:
 
Each node present in the Hadoop Cluster is assigned specific roles in accordance with each running Apache Program. These roles can be useful for development and debugging.
 
'''Wh-520-1-2:'''
 
HBase Master
 
HDFS Balancer
 
HDFS NameNode
 
HDFS SecondaryNameNode
 
Hive Gateway
 
Hive Metastore Server
 
HiveServer2
 
Hue Server
 
Impala Catalog Server
 
Impala StateStore
 
Key-Value Store Indexer Lily HBase Indexer
 
Cloudera Management Service Alert Publisher
 
Cloudera Management Service Event Server
 
Cloudera Management Service Host Monitor
 
Cloudera Management Service Service Monitor
 
Oozie Server
 
Solr Server
 
Spark Gateway
 
Spark History Server
 
YARN (MR2 Included) JobHistory Server
 
YARN (MR2 Included) ResourceManager
 
ZooKeeper Server
 
'''Other Nodes:'''
 
HBase RegionServer
 
HDFS DataNode
 
Hive Gateway
 
Impala Daemon
 
Spark Gateway
 
YARN (MR2 Included) NodeManager

Latest revision as of 14:14, 21 February 2017

Hadoop Cluster

Purpose:

The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.


Alpha Cluster Status:

Currently, the Hadoop Cluster is offline. Please monitor this page for any change in its status.