Latest revision as of 14:14, 21 February 2017

Hadoop Cluster

Purpose:

The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.

Alpha Cluster Status:

Currently, the Hadoop Cluster is offline. Please monitor this page for any change in its status.

Difference between revisions of "Hadoop Cluster"

Latest revision as of 14:14, 21 February 2017

Hadoop Cluster

Purpose:

Alpha Cluster Status:

Search

Personal tools

Tools

USF LINKS

@@ Line 5: / Line 5: @@
 The Apache ecosystem is growing in popularity amongst researchers and data scientists. Apache Hadoop and Spark are both becoming the most prominent tools for analyzing data in a “big data” settings. Due to the demand for this software, our team has decided to develop a system which will serve as a means of developing and running software specific to the Apache Hadoop ecosystem. This system will feature nodes running exclusively Hadoop-related software.
-== Available Software: ==
-{| class=wikitable
-|avro-tools
-|llama
-|sqoop-export
-|-
-|beeline
-|llamaadmin
-|sqoop-help
-|-
-|bigtop-detect-javahome
-|load_gen
-|sqoop-import
-|-
-|catalogd
-|mahout
-|sqoop-import-all-tables
-|-
-|cli_mt
-|mapred
-|sqoop-job
-|-
-|cli_st
-|oozie
-|sqoop-list-databases
-|-
-|flume-ng
-|oozie-setup
-|sqoop-list-tables
-|-
-|hadoop
-|parquet-tools
-|sqoop-merge
-|-
-|hadoop-0.20
-|pig
-|sqoop-metastore
-|-
-|hadoop-fuse-dfs
-|pyspark
-|sqoop-version
-|-
-|hadoop-fuse-dfs.orig
-|sentry
-|statestored
-|-
-|hbase
-|solrctl
-|whirr
-|-
-|hbase-indexer
-|spark-executorspark-shell
-|yarn
-|-
-|hcat
-|spark-submit
-|zookeeper-client
-|-
-|hdfs
-|sqoop
-|zookeeper-server
-|-
-|hive
-|sqoop2
-|zookeeper-server-cleanup
-|-
-|hiveserver2
-|sqoop2-server
-|zookeeper-server-initialize
-|-
-|impalad
-|sqoop2-tool
-|
-|-
-|impala-shell
-|sqoop-codegen
-|
-|-
-|kite-dataset
-|sqoop-create-hive-table
-|
-|-
-|
-|sqoop-eval
-|
-|}
 == Alpha Cluster Status: ==
-Currently, the Hadoop Cluster is in the alpha-testing phase. As such: Research Computing will be able to only provide limited support with the Hadoop software.
+Currently, the Hadoop Cluster is offline. Please monitor this page for any change in its status.
-== User Interfaces: ==
-SSH:
-To run jobs on the Hadoop Cluster, you will first need to login to the Hadoop MasterNode using the following command:
-<pre style="white-space:pre-wrap; width:25%; border:1px solid lightgrey; background:#000000; color:white;">
-ssh hadoop.rc.usf.edu</pre>
-**Please note that only individual who have been granted access to the Hadoop Cluster will be able to connect.
-== Cluster Specifications: ==
-<pre style="white-space:pre-wrap; width:40%; border:1px solid lightgrey; background:#000000; color:white;">
-Configured Capacity: 15419163328512 (14.02 TB)
-Replication Factor: 3
-Scheduling: YARN Fair Scheduling</pre>
-== File System Specifications: ==
-The Hadoop Cluster uses a distributed file system necessary known as Hadoop Distributed File System (HDFS). This file system, along with the Hadoop MapReduce framework, defines Hadoop as a data analysis system. HDFS is a distrbuted file system, and is generally used to house both input and output files for Hadoop. The HDFS can be accessed using the following command:
-Hadoop fs -”command” Where, “command” is the appropriate command that you wish to run with. The full list of Hadoop filesystem (fs) commands can be studied here: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
-'''What the Hadoop File System IS:'''
-* A distributed file system
-* A means of storing input and output data for your Apache Hadoop jobs
-'''What the Hadoop File System IS NOT:'''
-* A File System for developing code, compiling code, or otherwise running non-Hadoop application
-* A Fully functional File System
-== Why Use HDFS? ==
-It may seem odd to use an entire distributed file system for simply storing input and output, however, there are a few reasons why this is necessary for the Hadoop system. Hadoop increases performance in the analysis of data by storing data locally on nodes, and running tasks in parallel on this data locally. This type of data locality eliminates any issues that may arise in transmitting large quantities of data across an entire cluster during computation. The HDFS is just one layer of abstraction between the user running jobs, and the data which has been distributed on the cluster. Essentially, by uploading input data into the HDFS, you are distributing this data in small chunks across the entire Hadoop cluster. The Hadoop file system allows you to see these blocks as one file, and run your specified jobs on these blocks without navigating through every blocks and assigning tasks individually.
-**Please refer to our individual Hadoop documentation for information on running jobs using input located in the HDFS.
-Ch2. Current Machine Roles:
-Each node present in the Hadoop Cluster is assigned specific roles in accordance with each running Apache Program. These roles can be useful for development and debugging.
-'''Wh-520-1-2:'''
-HBase Master
-HDFS Balancer
-HDFS NameNode
-HDFS SecondaryNameNode
-Hive Gateway
-Hive Metastore Server
-HiveServer2
-Hue Server
-Impala Catalog Server
-Impala StateStore
-Key-Value Store Indexer Lily HBase Indexer
-Cloudera Management Service Alert Publisher
-Cloudera Management Service Event Server
-Cloudera Management Service Host Monitor
-Cloudera Management Service Service Monitor
-Oozie Server
-Solr Server
-Spark Gateway
-Spark History Server
-YARN (MR2 Included) JobHistory Server
-YARN (MR2 Included) ResourceManager
-ZooKeeper Server
-'''Other Nodes:'''
-HBase RegionServer
-HDFS DataNode
-Hive Gateway
-Impala Daemon
-Spark Gateway
-YARN (MR2 Included) NodeManager