Hadoop Cluster
Hadoop Cluster
Purpose:
The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.
Available Software:
avro-tools | llama | sqoop-export |
beeline | llamaadmin | sqoop-help |
bigtop-detect-javahome | load_gen | sqoop-import |
catalogd | mahout | sqoop-import-all-tables |
cli_mt | mapred | sqoop-job |
cli_st | oozie | sqoop-list-databases |
flume-ng | oozie-setup | sqoop-list-tables |
hadoop | parquet-tools | sqoop-merge |
hadoop-0.20 | pig | sqoop-metastore |
hadoop-fuse-dfs | pyspark | sqoop-version |
hadoop-fuse-dfs.orig | sentry | statestored |
hbase | solrctl | whirr |
hbase-indexer | spark-executorspark-shell | yarn |
hcat | spark-submit | zookeeper-client |
hdfs | sqoop | zookeeper-server |
hive | sqoop2 | zookeeper-server-cleanup |
hiveserver2 | sqoop2-server | zookeeper-server-initialize |
impalad | sqoop2-tool | |
impala-shell | sqoop-codegen | |
kite-dataset | sqoop-create-hive-table | |
sqoop-eval |
Alpha Cluster Status:
Currently, the Hadoop Cluster is in the alpha-testing phase. As such: Research Computing will be able to only provide limited support with the Hadoop software.
User Interfaces:
SSH:
To run jobs on the Hadoop Cluster, you will first need to login to the Hadoop MasterNode using the following command:
ssh hadoop.rc.usf.edu
- Please note that only individual who have been granted access to the Hadoop Cluster will be able to connect.
Cluster Specifications:
Configured Capacity: 15419163328512 (14.02 TB) Replication Factor: 3 Scheduling: YARN Fair Scheduling
File System Specifications:
The Hadoop Cluster uses a distributed file system necessary known as Hadoop Distributed File System (HDFS). This file system, along with the Hadoop MapReduce framework, defines Hadoop as a data analysis system. HDFS is a distrbuted file system, and is generally used to house both input and output files for Hadoop. The HDFS can be accessed using the following command:
Hadoop fs -”command” Where, “command” is the appropriate command that you wish to run with. The full list of Hadoop filesystem (fs) commands can be studied here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
What the Hadoop File System IS:
- A distributed file system
- A means of storing input and output data for your Apache Hadoop jobs
What the Hadoop File System IS NOT:
- A File System for developing code, compiling code, or otherwise running non-Hadoop application
- A Fully functional File System
Why Use HDFS?
It may seem odd to use an entire distributed file system for simply storing input and output, however, there are a few reasons why this is necessary for the Hadoop system. Hadoop increases performance in the analysis of data by storing data locally on nodes, and running tasks in parallel on this data locally. This type of data locality eliminates any issues that may arise in transmitting large quantities of data across an entire cluster during computation. The HDFS is just one layer of abstraction between the user running jobs, and the data which has been distributed on the cluster. Essentially, by uploading input data into the HDFS, you are distributing this data in small chunks across the entire Hadoop cluster. The Hadoop file system allows you to see these blocks as one file, and run your specified jobs on these blocks without navigating through every blocks and assigning tasks individually.
- Please refer to our individual Hadoop documentation for information on running jobs using input located in the HDFS.
Ch2. Current Machine Roles:
Each node present in the Hadoop Cluster is assigned specific roles in accordance with each running Apache Program. These roles can be useful for development and debugging.
Wh-520-1-2:
HBase Master
HDFS Balancer
HDFS NameNode
HDFS SecondaryNameNode
Hive Gateway
Hive Metastore Server
HiveServer2
Hue Server
Impala Catalog Server
Impala StateStore
Key-Value Store Indexer Lily HBase Indexer
Cloudera Management Service Alert Publisher
Cloudera Management Service Event Server
Cloudera Management Service Host Monitor
Cloudera Management Service Service Monitor
Oozie Server
Solr Server
Spark Gateway
Spark History Server
YARN (MR2 Included) JobHistory Server
YARN (MR2 Included) ResourceManager
ZooKeeper Server
Other Nodes:
HBase RegionServer
HDFS DataNode
Hive Gateway
Impala Daemon
Spark Gateway
YARN (MR2 Included) NodeManager