This article is going to be about my experience with the Cloudera Administrator Exam and how to study for it. I’ve attended the exam before they reworked the platform, but the exam requests remain the same. You can find more information from Cloudera on the exam here.
If you want to know more about how and what to study for the exam you can skip to the Studying for the Exam section of the article.
If you have any questions or suggestions, feel free to leave a comment.
Advised Previous Experience
This exam expects you to have previous Linux SysAdmin knowledge, and some experience administering the Cloudera Stack. If you’ve never worked with Cloudera before, or don’t have a modicum of experience using a terminal and solving administrative issues, this is not going to be easy for you :D
If you’re comfortable around the Cloudera Manager, and have debugged problems with JDK versions in cluster nodes (as an example), you’re probably going to do just fine on the exam. It’s not terribly complicated, but it is tricky. During the studying sections of the article I talk a bit about my experience studying for the actual exam, and I expose some views that might help you get a better glimpse of what you’re going to face.
About Taking the Exam
On the Cloudera Certification FAQ you’ll find details of what you need to ensure in order to take the exam.
You really do have to take to the letter the requirements for your desk, the room you’re taking the exam in, and noise. I had to switch rooms before I started the exam because “there was too much” on my desk and walls. Also, they don’t really tell you this anywhere, but you’re not supposed to have headphones on, and they will ask you to take them off. Make sure your cam and mic work, and if you are on a laptop be ready to move your laptop around to show the examiner your room.
The platform at the time was pretty slow, and it will stress you out if you don’t take deep breaths. I think they mostly fixed those delay issues with the new platform they introduced, but if they didn’t.. Well, be ready.
Also, you can view some Cloudera and Apache Documentation during the exam, so knowing your way through the documentation, and how to get to relevant info is important. I didn’t have a lot of luck navigating it since the plaform was slow and just took up precious time. You should be ready for it failing on you or knowing how to get there if you really need to.
The exam duration is 120 minutes, and it’s tight. If you have any doubts on a few questions, or if you’re struggling with a particular one, you will reach the time limit. If you’re too stuck on a question, move on to the next one and come back when you’ve cleared the rest.
Studying for the Exam
In this case, and in almost all the others, Google is your friend :), so searching for feedback on the exam, answers to your questions and diving into the Cloudera and Apache Documentation is something you should be doing often.
On this line, I found that Kannan over at Hadoop and Cloud has cool and detailed information on each exam requirement.
The video that was previously showcased on the Cloudera exam page was pretty interesting and gave neat insights into how the questions were structured, so check it out if you haven’t.
Since this is a hands-on exam, you can’t get away with knowing how to do things in theory. The Cloudera Hadoop Environment has a lot of ins and outs that you’ll only know about by doing. My advice would be for you to make a checklist of items based on:
- The exam requirements;
- What you need to learn how to do by heart;
- What you don’t understand completely;
- Your experience.
And actually get to it.
Getting your Hands Dirty
So, if you’ve never setup a Cloudera Cluster before, this should be a great excuse to start.
Setup a few VMs locally (or in the cloud) or give it a go with Containers and get at it!
Install a cluster with HDFS, Hive, Hue, Kafka, Kudu, Impala and all their required dependencies. You can also try and set up a local CDH repository and take one item out of the list.
If you’ve set one up before, or you have a test cluster laying around, get ready to give it some use, potentially break it, and hopefully fix it.
If you break the test cluster in a way that most of the services are unusable, say brick the HDFS, you’re pretty much done on the exam, so double check the state of your cluster after every change.
My Knowledge Checklist
So, during my study I composed this checklist, based on the exam requirements, and other information I could gather:
- HDFS
- Setting up High Availabilty
- How to make and restore a Checkpoint
- How to make and restore Snapshots
- How to change block size
- What is the Sticky Bit
- How to use and change ACLs
- The fsck and dfsadmin commands
- Load Balancing with a Specific Bandwidth (Balancer role and terminal commands)
- How does HttpFs work
- Changing and Setting up Trash
- Compression
- Cloudera Manager
- Set up Alerting
- OS Level Configurations required
- Make and use Host Templates
- Make a CDH Repo
- Make and use Role Groups
- Set up and use Racks
- Other Services
- Set up a proxy for Hive/Impala
- Install and perform a Sqoop from MySQL to Hive
- Look at Flume, and try to use it.
- Hue/Sentry authorization/authentication
- Look at Yarn Fair Scheduler
- Change Yarn/Impala pools
You should also look into the several settings that you have access through the Cloudera Manager, you might be asked to changed some of those settings for a different number of reasons.
In case you hadn’t noticed yet, the CM is going to be a major part of the exam, and knowing how to use it properly will make a major difference.
Based on this checklist and my study I also wrote some notes available below. For my exam I used CDH 5.14 but for the new exam you need to work with CDH 6, so some changes might be needed.
Change HDFS Block Size
In order to change the HDFS block size, you need to make a new copy of the exiting file:
hdfs dfs -D dfs.block.size=134217728 -cp /origin /dest
Note that the block size is in bytes.
Sqoop
Working with Sqoop is a little out of the scope of this article, so here’s a link for a walkthough on how to import data into Hive from MySql:
Sqoop: Import Data From MySQL to Hive - DZone Big Data
Snapshots
Learning how Snapshots work, and how to recover from them is important. Either using CM or directly on the command-line HDFS tool:
Apache Hadoop 3.0.0 – HDFS Snapshots
Decomission Datanodes
Decomissioning Datanodes is a simple task, but it’s implications are important:
Decommissioning and Recommissioning Hosts | 5.2.x | Cloudera Documentation
Disk Alerts
In order to set-up disk alerts, you need to configure Alert Publisher:
Configuring Alert Email Delivery | 6.3.x | Cloudera Documentation
And enable alerts on the HDFS config along with a possible change in the Free Space Treshold setting.
HDFS Balancer
In order to balance the HDFS Cluster, you can use the Balancer role action through CM. You can also change the balacing threshold, which is by default 10%. The property ‘Maximum Concurrent Moves’ sets the maximum number of threads used by the DataNode Balancer. The balancer can be run at a specific bandwidht:
hdfs dfsadmin -setBalancerBandwidth <bandwidth in bytes per second>
Hive/Impala Proxy
The following sections show steps and information required to set up a Proxy for Hive and Impala:
HS2
- HiveServers run on default port 10000
- HS2 instances and Haproxy should be on different servers
- yum install haproxy
- vi /etc/haproxy/haproxy.cfg
listen hiveserver2 :10000 mode tcp option tcplog balance leastconn server server1 master:10000 server server2 standby:10000
- Start and enable
service haproxy start chkconfig haproxy on
- On CMs Hive Config provide the address for the Load Balancer
- Test:
beeline –u ‘jdbc:hive2://server3:10000/default’
or through Hue
Impala
- Impala daemon default is 21050
- HAProxy should listen on 21000
- Config:
listen impala :21000 mode tcp option tcplog balance leastconn server server1 master:21050 server server2 standby:21050 server server3 slave1:21050
- On CMs Impala Config provide the address for the Load Balancer
- Test:
beeline –u jdbc:impala://server3:21000
or through Hue
OS Level Config
Disable SELINUX, edit /etc/sysconfig/selinux
:
selinux=disabled
Set vm.swappiness
to 10 or less.
sysctl vm.swappiness=10
sysctl -p
Verify Hostnames, edit /etc/hosts
if there is no suitable DNS resolution.
Additional: Set noatime, so it doesn’t store the blocks accessed time in memory, by editing the /etc/fstab
file and adding the noatime option:
/dev/sda2 /hadoop ext3 noatime,defaults 0 0
Disable Transparent Huge Pages (THP) by editing /etc/rc.local
and inserting:
echo 'never' > /sys/kernel/mm/redhat_transparent_hugepage/defrag
CDH Repo
Steps to create a local CDH Repository:
RPMs
- Download the repo into the machine:
wget https://archive.cloudera.com/cdh5/redhat/6/x86_64/cdh/cloudera-cdh5.repo
- Move cloudera-cdh5.repo to
/etc/yum.repos.d
- Install a webserver:
yum install httpd
and start the service - Install
yum-utils
andcreaterepo
- Fetch the repo:
reposync -r cloudera-cdh5
- Copy the RPM’s on to /var/www/html/cdh/5/rpms/
- Inside /var/www/html/cdh/5/ run
createrepo
- Edit the first file downloaded and replace baseurl with
http://servername/cdh/5/
- Distribute the file
Parcels
- Install webserver:
yum install httpd
and start the service sudo mkdir -p /var/www/html/cloudera-repos
sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/cdh5/parcels/5.14.4/ -P /var/www/html/cloudera-repos
sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/gplextras5/parcels/5.14.4/ -P /var/www/html/cloudera-repos
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cdh5
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/gplextras5
Checkpoints
About Checkpoints:
Backing Up and Restoring NameNode Metadata
Forcing a Checkpoint:
hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
hdfs dfsadmin -safemode leave
HDFS High Availability
About enabling NameNode High Availability: