Hadoop Cluster

Purpose:

The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.

Available Software:

avro-tools	llama	sqoop-export
beeline	llamaadmin	sqoop-help
bigtop-detect-javahome	load_gen	sqoop-import
catalogd	mahout	sqoop-import-all-tables
cli_mt	mapred	sqoop-job
cli_st	oozie	sqoop-list-databases
flume-ng	oozie-setup	sqoop-list-tables
hadoop	parquet-tools	sqoop-merge
hadoop-0.20	pig	sqoop-metastore
hadoop-fuse-dfs	pyspark	sqoop-version
hadoop-fuse-dfs.orig	sentry	statestored
hbase	solrctl	whirr
hbase-indexer	spark-executorspark-shell	yarn
hcat	spark-submit	zookeeper-client
hdfs	sqoop	zookeeper-server
hive	sqoop2	zookeeper-server-cleanup
hiveserver2	sqoop2-server	zookeeper-server-initialize
impalad	sqoop2-tool
impala-shell	sqoop-codegen
kite-dataset	sqoop-create-hive-table
	sqoop-eval

Alpha Cluster Status:

Currently, the Hadoop Cluster is in the alpha-testing phase. As such: Research Computing will be able to only provide limited support with the Hadoop software.

User Interfaces:

SSH:

To run jobs on the Hadoop Cluster, you will first need to login to the Hadoop MasterNode using the following command:

ssh hadoop.rc.usf.edu

- Please note that only individual who have been granted access to the Hadoop Cluster will be able to connect.

Cluster Specifications:

Configured Capacity: 15419163328512 (14.02 TB) 
Replication Factor: 3
Scheduling: YARN Fair Scheduling

File System Specifications:

The Hadoop Cluster uses a distributed file system necessary known as Hadoop Distributed File System (HDFS). This file system, along with the Hadoop MapReduce framework, defines Hadoop as a data analysis system. HDFS is a distrbuted file system, and is generally used to house both input and output files for Hadoop. The HDFS can be accessed using the following command:

Hadoop fs -”command” Where, “command” is the appropriate command that you wish to run with. The full list of Hadoop filesystem (fs) commands can be studied here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html

What the Hadoop File System IS:

A distributed file system

A means of storing input and output data for your Apache Hadoop jobs

What the Hadoop File System IS NOT:

A File System for developing code, compiling code, or otherwise running non-Hadoop application

A Fully functional File System

Why Use HDFS?

It may seem odd to use an entire distributed file system for simply storing input and output, however, there are a few reasons why this is necessary for the Hadoop system. Hadoop increases performance in the analysis of data by storing data locally on nodes, and running tasks in parallel on this data locally. This type of data locality eliminates any issues that may arise in transmitting large quantities of data across an entire cluster during computation. The HDFS is just one layer of abstraction between the user running jobs, and the data which has been distributed on the cluster. Essentially, by uploading input data into the HDFS, you are distributing this data in small chunks across the entire Hadoop cluster. The Hadoop file system allows you to see these blocks as one file, and run your specified jobs on these blocks without navigating through every blocks and assigning tasks individually.

- Please refer to our individual Hadoop documentation for information on running jobs using input located in the HDFS.

Ch2. Current Machine Roles:

Each node present in the Hadoop Cluster is assigned specific roles in accordance with each running Apache Program. These roles can be useful for development and debugging.

Wh-520-1-2:

HBase Master

HDFS Balancer

HDFS NameNode

HDFS SecondaryNameNode

Hive Gateway

Hive Metastore Server

HiveServer2

Hue Server

Impala Catalog Server

Impala StateStore

Key-Value Store Indexer Lily HBase Indexer

Cloudera Management Service Alert Publisher

Cloudera Management Service Event Server

Cloudera Management Service Host Monitor

Cloudera Management Service Service Monitor

Oozie Server

Solr Server

Spark Gateway

Spark History Server

YARN (MR2 Included) JobHistory Server

YARN (MR2 Included) ResourceManager

ZooKeeper Server

Other Nodes:

HBase RegionServer

HDFS DataNode

Hive Gateway

Impala Daemon

Spark Gateway

YARN (MR2 Included) NodeManager

Hadoop Cluster

Contents

Hadoop Cluster

Purpose:

Available Software:

Alpha Cluster Status:

User Interfaces:

Cluster Specifications:

File System Specifications:

Why Use HDFS?

Search

Personal tools

Tools

USF LINKS