CCA131 Study and Feedback

This article is going to be about my experience with the Cloudera Administrator Exam and how to study for it. I’ve attended the exam before they reworked the platform, but the exam requests remain the same. You can find more information from Cloudera on the exam here.

If you want to know more about how and what to study for the exam you can skip to the Studying for the Exam section of the article.

If you have any questions or suggestions, feel free to leave a comment.

Advised Previous Experience

This exam expects you to have previous Linux SysAdmin knowledge, and some experience administering the Cloudera Stack. If you’ve never worked with Cloudera before, or don’t have a modicum of experience using a terminal and solving administrative issues, this is not going to be easy for you :D

If you’re comfortable around the Cloudera Manager, and have debugged problems with JDK versions in cluster nodes (as an example), you’re probably going to do just fine on the exam. It’s not terribly complicated, but it is tricky. During the studying sections of the article I talk a bit about my experience studying for the actual exam, and I expose some views that might help you get a better glimpse of what you’re going to face.

About Taking the Exam

On the Cloudera Certification FAQ you’ll find details of what you need to ensure in order to take the exam.

You really do have to take to the letter the requirements for your desk, the room you’re taking the exam in, and noise. I had to switch rooms before I started the exam because “there was too much” on my desk and walls. Also, they don’t really tell you this anywhere, but you’re not supposed to have headphones on, and they will ask you to take them off. Make sure your cam and mic work, and if you are on a laptop be ready to move your laptop around to show the examiner your room.

The platform at the time was pretty slow, and it will stress you out if you don’t take deep breaths. I think they mostly fixed those delay issues with the new platform they introduced, but if they didn’t.. Well, be ready.

Also, you can view some Cloudera and Apache Documentation during the exam, so knowing your way through the documentation, and how to get to relevant info is important. I didn’t have a lot of luck navigating it since the plaform was slow and just took up precious time. You should be ready for it failing on you or knowing how to get there if you really need to.

The exam duration is 120 minutes, and it’s tight. If you have any doubts on a few questions, or if you’re struggling with a particular one, you will reach the time limit. If you’re too stuck on a question, move on to the next one and come back when you’ve cleared the rest.

Studying for the Exam

In this case, and in almost all the others, Google is your friend :), so searching for feedback on the exam, answers to your questions and diving into the Cloudera and Apache Documentation is something you should be doing often.

On this line, I found that Kannan over at Hadoop and Cloud has cool and detailed information on each exam requirement.

The video that was previously showcased on the Cloudera exam page was pretty interesting and gave neat insights into how the questions were structured, so check it out if you haven’t.

Since this is a hands-on exam, you can’t get away with knowing how to do things in theory. The Cloudera Hadoop Environment has a lot of ins and outs that you’ll only know about by doing. My advice would be for you to make a checklist of items based on:

The exam requirements;
What you need to learn how to do by heart;
What you don’t understand completely;
Your experience.

And actually get to it.

Getting your Hands Dirty

So, if you’ve never setup a Cloudera Cluster before, this should be a great excuse to start.

Setup a few VMs locally (or in the cloud) or give it a go with Containers and get at it!

Install a cluster with HDFS, Hive, Hue, Kafka, Kudu, Impala and all their required dependencies. You can also try and set up a local CDH repository and take one item out of the list.

If you’ve set one up before, or you have a test cluster laying around, get ready to give it some use, potentially break it, and hopefully fix it.

If you break the test cluster in a way that most of the services are unusable, say brick the HDFS, you’re pretty much done on the exam, so double check the state of your cluster after every change.

My Knowledge Checklist

So, during my study I composed this checklist, based on the exam requirements, and other information I could gather:

HDFS
- Setting up High Availabilty
- How to make and restore a Checkpoint
- How to make and restore Snapshots
- How to change block size
- What is the Sticky Bit
- How to use and change ACLs
- The fsck and dfsadmin commands
- Load Balancing with a Specific Bandwidth (Balancer role and terminal commands)
- How does HttpFs work
- Changing and Setting up Trash
- Compression
Cloudera Manager
- Set up Alerting
- OS Level Configurations required
- Make and use Host Templates
- Make a CDH Repo
- Make and use Role Groups
- Set up and use Racks
Other Services
- Set up a proxy for Hive/Impala
- Install and perform a Sqoop from MySQL to Hive
- Look at Flume, and try to use it.
- Hue/Sentry authorization/authentication
- Look at Yarn Fair Scheduler
- Change Yarn/Impala pools

You should also look into the several settings that you have access through the Cloudera Manager, you might be asked to changed some of those settings for a different number of reasons.

In case you hadn’t noticed yet, the CM is going to be a major part of the exam, and knowing how to use it properly will make a major difference.

Based on this checklist and my study I also wrote some notes available below. For my exam I used CDH 5.14 but for the new exam you need to work with CDH 6, so some changes might be needed.

Change HDFS Block Size

In order to change the HDFS block size, you need to make a new copy of the exiting file:

hdfs dfs -D dfs.block.size=134217728 -cp /origin /dest

Note that the block size is in bytes.

Sqoop

Working with Sqoop is a little out of the scope of this article, so here’s a link for a walkthough on how to import data into Hive from MySql:

Sqoop: Import Data From MySQL to Hive - DZone Big Data

Snapshots

Learning how Snapshots work, and how to recover from them is important. Either using CM or directly on the command-line HDFS tool:

Apache Hadoop 3.0.0 – HDFS Snapshots

Decomission Datanodes

Decomissioning Datanodes is a simple task, but it’s implications are important:

Decommissioning and Recommissioning Hosts | 5.2.x | Cloudera Documentation

Disk Alerts

In order to set-up disk alerts, you need to configure Alert Publisher:

Configuring Alert Email Delivery | 6.3.x | Cloudera Documentation

And enable alerts on the HDFS config along with a possible change in the Free Space Treshold setting.

HDFS Balancer

In order to balance the HDFS Cluster, you can use the Balancer role action through CM. You can also change the balacing threshold, which is by default 10%. The property ‘Maximum Concurrent Moves’ sets the maximum number of threads used by the DataNode Balancer. The balancer can be run at a specific bandwidht:

hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>

Hive/Impala Proxy

The following sections show steps and information required to set up a Proxy for Hive and Impala:

HS2

HiveServers run on default port 10000
HS2 instances and Haproxy should be on different servers
yum install haproxy

vi /etc/haproxy/haproxy.cfg

listen hiveserver2 :10000
mode tcp
option tcplog
balance leastconn
server server1 master:10000
server server2 standby:10000

Start and enable

service haproxy start
chkconfig haproxy on

On CMs Hive Config provide the address for the Load Balancer
Test: beeline –u ‘jdbc:hive2://server3:10000/default’ or through Hue

Impala

Impala daemon default is 21050
HAProxy should listen on 21000

Config:

listen impala :21000
mode tcp
option tcplog
balance leastconn
server server1 master:21050
server server2 standby:21050
server server3 slave1:21050

On CMs Impala Config provide the address for the Load Balancer
Test: beeline –u jdbc:impala://server3:21000 or through Hue

OS Level Config

Disable SELINUX, edit /etc/sysconfig/selinux:

selinux=disabled

Set vm.swappiness to 10 or less.

sysctl vm.swappiness=10
sysctl -p

Verify Hostnames, edit /etc/hosts if there is no suitable DNS resolution.

Additional: Set noatime, so it doesn’t store the blocks accessed time in memory, by editing the /etc/fstab file and adding the noatime option:

/dev/sda2 /hadoop ext3 noatime,defaults 0 0

Disable Transparent Huge Pages (THP) by editing /etc/rc.local and inserting:

echo 'never' > /sys/kernel/mm/redhat_transparent_hugepage/defrag

CDH Repo

Steps to create a local CDH Repository:

RPMs

Download the repo into the machine: wget https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
Move cloudera-cdh5.repo to /etc/yum.repos.d
Install a webserver: yum install httpd and start the service
Install yum-utils and createrepo
Fetch the repo: reposync -r cloudera-cdh5
Copy the RPM’s on to /var/www/html/cdh/5/rpms/
Inside /var/www/html/cdh/5/ run createrepo
Edit the first file downloaded and replace baseurl with http://servername/cdh/5/
Distribute the file

Parcels

Install webserver: yum install httpd and start the service
sudo mkdir -p /var/www/html/cloudera-repos
sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/cdh5/parcels/5.14.4/ -P /var/www/html/cloudera-repos
sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/gplextras5/parcels/5.14.4/ -P /var/www/html/cloudera-repos
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cdh5
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/gplextras5

Checkpoints

About Checkpoints:

Backing Up and Restoring NameNode Metadata

Forcing a Checkpoint:

hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave

HDFS High Availability

About enabling NameNode High Availability:

Enabling HDFS HA