System Adminstrators have to enhance their skills to handle Hadoop clusters

Here are the main tasks you would need to study to become a Hadoop administrator:

  • Configure the job trackers and the task trackers.
  • Install and use Hbase.
  • Configure Fair scheduler to control who is running what job when.
  • Manage health of cluster.
  • Plan Hadoop cluster hardware.
  • Learn something about Pig, Hive, and Impala. The Hadoop administrator would probably be expected to know these basic Hadoop query languages.
  • Configure HDFS for high availability.
  • How to load data into HDFS.
  • Hadoop Security (Apache Use).

System administrators face constant streams of change requests, such as  system events, hardware failures and other demands for their time and expertise. Fortunately, many aspects of these tasks can be automated now days. Monitoring programs, patch management systems, and asset management applications have streamlined some of the system administration tasks associated with Windows and Linux servers. Unfortunately, the same cannot be said for administering Hadoop clusters. Differences in the way Hadoop clusters operate lead to a few common and time consuming problems for Hadoop administrators.

Hadoop Adminstrators must have a sufficient knowledge in linux shell scripting though they are from Windows background. Managing data and clusters at scale presents different challenges to those associated with running test data through a couple of machines. Again and again, organizational deployments of Hadoop fail as they simplistically try to replicate processes and procedures tested on one or two machines across more-complex clusters.

When we deployment of data happens into hadoop cluster as part of a meaningful pilot project or to deliver production enterprise workloads, is a challenging undertaking, and managing the complex interactions among parallel nodes remains a complex and often largely manual process. Each node must be actively monitored throughout its commitment to a particular workload.

When it comes to network, it should be designed in such a way that it should not hit the performance of the cluster. Learning about Hadoop by setting up a small test cluster for a system administrator into thinking she or he knows how to manage a production cluster. This is rarely the case. The different load conditions, Mapreduce jobs and the errors can be difficult to manage and debug even for Hadoop experts.