Hadoop Cluster

Revision as of 17:49, 24 June 2016 by Botto (talk | contribs) (Created page with "= Hadoop Cluster = == Purpose: == The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most p...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Hadoop Cluster

Purpose:

The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.

Available Software:

avro-tools llama sqoop-export
beeline llamaadmin sqoop-help
bigtop-detect-javahome load_gen sqoop-import
catalogd mahout sqoop-import-all-tables
cli_mt mapred sqoop-job
cli_st oozie sqoop-list-databases
flume-ng oozie-setup sqoop-list-tables
hadoop parquet-tools sqoop-merge
hadoop-0.20 pig sqoop-metastore
hadoop-fuse-dfs pyspark sqoop-version
hadoop-fuse-dfs.orig sentry statestored
hbase solrctl whirr
hbase-indexer spark-executorspark-shell yarn
hcat spark-submit zookeeper-client
hdfs sqoop zookeeper-server
hive sqoop2 zookeeper-server-cleanup
hiveserver2 sqoop2-server zookeeper-server-initialize
impalad sqoop2-tool
impala-shell sqoop-codegen
kite-dataset sqoop-create-hive-table
sqoop-eval

Alpha Cluster Status:

Currently, the Hadoop Cluster is in the alpha-testing phase. As such: Research Computing will be able to only provide limited support with the Hadoop software.

User Interfaces:

SSH:

To run jobs on the Hadoop Cluster, you will first need to login to the Hadoop MasterNode using the following command:

ssh hadoop.rc.usf.edu
    • Please note that only individual who have been granted access to the Hadoop Cluster will be able to connect.

Cluster Specifications:

Configured Capacity: 15419163328512 (14.02 TB) 
Replication Factor: 3
Scheduling: YARN Fair Scheduling

File System Specifications:

The Hadoop Cluster uses a distributed file system necessary known as Hadoop Distributed File System (HDFS). This file system, along with the Hadoop MapReduce framework, defines Hadoop as a data analysis system. HDFS is a distrbuted file system, and is generally used to house both input and output files for Hadoop. The HDFS can be accessed using the following command:

Hadoop fs -”command” Where, “command” is the appropriate command that you wish to run with. The full list of Hadoop filesystem (fs) commands can be studied here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html

What the Hadoop File System IS:

  • A distributed file system
  • A means of storing input and output data for your Apache Hadoop jobs

What the Hadoop File System IS NOT:

  • A File System for developing code, compiling code, or otherwise running non-Hadoop application
  • A Fully functional File System

Why Use HDFS?

It may seem odd to use an entire distributed file system for simply storing input and output, however, there are a few reasons why this is necessary for the Hadoop system. Hadoop increases performance in the analysis of data by storing data locally on nodes, and running tasks in parallel on this data locally. This type of data locality eliminates any issues that may arise in transmitting large quantities of data across an entire cluster during computation. The HDFS is just one layer of abstraction between the user running jobs, and the data which has been distributed on the cluster. Essentially, by uploading input data into the HDFS, you are distributing this data in small chunks across the entire Hadoop cluster. The Hadoop file system allows you to see these blocks as one file, and run your specified jobs on these blocks without navigating through every blocks and assigning tasks individually.

    • Please refer to our individual Hadoop documentation for information on running jobs using input located in the HDFS.

Ch2. Current Machine Roles:

Each node present in the Hadoop Cluster is assigned specific roles in accordance with each running Apache Program. These roles can be useful for development and debugging.

Wh-520-1-2:

HBase Master

HDFS Balancer

HDFS NameNode

HDFS SecondaryNameNode

Hive Gateway

Hive Metastore Server

HiveServer2

Hue Server

Impala Catalog Server

Impala StateStore

Key-Value Store Indexer Lily HBase Indexer

Cloudera Management Service Alert Publisher

Cloudera Management Service Event Server

Cloudera Management Service Host Monitor

Cloudera Management Service Service Monitor

Oozie Server

Solr Server

Spark Gateway

Spark History Server

YARN (MR2 Included) JobHistory Server

YARN (MR2 Included) ResourceManager

ZooKeeper Server

Other Nodes:

HBase RegionServer

HDFS DataNode

Hive Gateway

Impala Daemon

Spark Gateway

YARN (MR2 Included) NodeManager