최신 CCA175 무료덤프 - Cloudera CCA Spark and Hadoop Developer

Problem Scenario 1:
You have been given MySQL DB with following details.
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1 . Connect MySQL DB and check the content of the tables.
2 . Copy "retaildb.categories" table to hdfs, without specifying directory name.
3 . Copy "retaildb.categories" table to hdfs, in a directory name "categories_target".
4 . Copy "retaildb.categories" table to hdfs, in a warehouse directory name
See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Connecting to existing MySQL Database mysql --user=retail_dba -- password=cloudera retail_db
Step 2 : Show all the available tables show tables;
Step 3 : View/Count data from a table in MySQL select count(1} from categories;
Step 4 : Check the currently available data in HDFS directory hdfs dfs -Is
Step 5 : Import Single table (Without specifying directory).
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=categories
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 6 : Read the data from one of the partition, created using above command, hdfs dfs - catxategories/part-m-00000
Step 7 : Specifying target directory in import command (We are using number of mappers
= 1, you can change accordingly) sqoop import -connect
jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera
~ table=categories -target-dir=categortes_target --m 1
Step 8 : Check the content in one of the partition file.
hdfs dfs -cat categories_target/part-m-00000
Step 9 : Specifying parent directory so that you can copy more than one table in a specified target directory. Command to specify warehouse directory.
sqoop import -.-connect jdbc:mysql://quickstart:3306/retail_db --username=retail dba - password=cloudera -table=categories -warehouse-dir=categories_warehouse --m 1
Problem Scenario 37 : ABCTECH.com has done survey on their Exam Products feedback using a web based form. With the following free text field as input in web ui.
Name: String
Subscription Date: String
Rating : String
And servey data has been saved in a file called spark9/feedback.txt
Christopher|Jan 11, 2015|5
Kapil|11 Jan, 2015|5
Write a spark program using regular expression which will filter all the valid dates and save in two separate file (good record and bad record)
See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create a file first using Hue in hdfs.
Step 2 : Write all valid regular expressions sysntex for checking whether records are having valid dates or not.
val regl =......(\d+)\s(\w{3})(,)\s(\d{4}).......r//11 Jan, 2015
val reg2 =......(\d+)(U)(\d+)(U)(\d{4})......s II 6/17/2014
val reg3 =......(\d+)(-)(\d+)(-)(\d{4})""".r//22-08-2013
val reg4 =......(\w{3})\s(\d+)(,)\s(\d{4})......s II Jan 11, 2015
Step 3 : Load the file as an RDD.
val feedbackRDD = sc.textFile("spark9/feedback.txt"}
Step 4 : As data are pipe separated , hence split the same. val feedbackSplit = feedbackRDD.map(line => line.split('|'))
Step 5 : Now get the valid records as well as , bad records.
val validRecords = feedbackSplit.filter(x =>
(reg1.pattern.matcher(x(1).trim).matches|reg2.pattern.matcher(x(1).trim).matches|reg3.patt ern.matcher(x(1).trim).matches | reg4.pattern.matcher(x(1).trim).matches)) val badRecords = feedbackSplit.filter(x =>
!(reg1.pattern.matcher(x(1).trim).matches|reg2.pattern.matcher(x(1).trim).matches|reg3.pat tern.matcher(x(1).trim).matches | reg4.pattern.matcher(x(1).trim).matches))
Step 6 : Now convert each Array to Strings
val valid =vatidRecords.map(e => (e(0),e(1),e(2)))
val bad =badRecords.map(e => (e(0),e(1),e(2)))
Step 7 : Save the output as a Text file and output must be written in a single tile, valid.repartition(1).saveAsTextFile("spark9/good.txt") bad.repartition(1).saveAsTextFile("sparkS7bad.txt")
Problem Scenario 42 : You have been given a file (sparklO/sales.txt), with the content as given in below.
And want to produce the output as a csv with group by Department,Designation,State with additional columns with sum(costToCompany) and TotalEmployeeCountt
Should get result like
See the explanation for Step by Step Solution and configuration.
Solution :
step 1 : Create a file first using Hue in hdfs.
Step 2 : Load tile as an RDD
val rawlines = sc.textFile("spark10/sales.txt")
Step 3 : Create a case class, which can represent its column fileds. case class
Employee(dep: String, des: String, cost: Double, state: String)
Step 4 : Split the data and create RDD of all Employee objects.
val employees = rawlines.map(_.split(",")).map(row=>Employee(row(0), row{1), row{2).toDouble, row{3)))
Step 5 : Create a row as we needed. All group by fields as a key and value as a count for each employee as well as its cost, val keyVals = employees.map( em => ((em.dep, em.des, em.state), (1 , em.cost)))
Step 6 : Group by all the records using reduceByKey method as we want summation as well. For number of employees and their total cost, val results = keyVals.reduceByKey{
(a,b) => (a._1 + b._1, a._2 + b._2)} // (a.count + b.count, a.cost + b.cost)}
Step 7 : Save the results in a text file as below.
Problem Scenario 52 : You have been given below code snippet.
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
Write a correct code snippet for Operation_xyz which will produce below output.
scalaxollection.Map[lnt,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> S, 2 -> 3, 4 -> 2, 7 ->
See the explanation for Step by Step Solution and configuration.
Solution :
Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)
Listing Variants
def countByValue(): Map[T, Long]
Problem Scenario 5 : You have been given following mysql database details.
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. List all the tables using sqoop command from retail_db
2. Write simple sqoop eval command to check whether you have permission to read database tables or not.
3 . Import all the tables as avro files in /user/hive/warehouse/retail cca174.db
4 . Import departments table as a text file in /user/cloudera/departments.
See the explanation for Step by Step Solution and configuration.
Step 1 : List tables using sqoop
sqoop list-tables --connect jdbc:mysql://quickstart:330G/retail_db --username retail dba - password cloudera
Step 2 : Eval command, just run a count query on one of the table.
sqoop eval \
--connect jdbc:mysql://quickstart:3306/retail_db \
-username retail_dba \
-password cloudera \
--query "select count(1) from ordeMtems"
Step 3 : Import all the tables as avro file.
sqoop import-all-tables \
-connect jdbc:mysql://quickstart:3306/retail_db \
-username=retail_dba \
-password=cloudera \
-as-avrodatafile \
-warehouse-dir=/user/hive/warehouse/retail stage.db \
Step 4 : Import departments table as a text file in /user/cloudera/departments sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
-username=retail_dba \
-password=cloudera \
-table departments \
-as-textfile \
Step 5 : Verify the imported data.
hdfs dfs -Is /user/cloudera/departments
hdfs dfs -Is /user/hive/warehouse/retailstage.db
hdfs dfs -Is /user/hive/warehouse/retail_stage.db/products
Problem Scenario 29 : Please accomplish the following exercises using HDFS command line options.
1. Create a directory in hdfs named hdfs_commands.
2. Create a file in hdfs named data.txt in hdfs_commands.
3. Now copy this data.txt file on local filesystem, however while copying file please make sure file properties are not changed e.g. file permissions.
4. Now create a file in local directory named data_local.txt and move this file to hdfs in hdfs_commands directory.
5. Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file system.
6. Create a file in local filesystem named file1.txt and put it to hdfs
See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create directory
hdfs dfs -mkdir hdfs_commands
Step 2 : Create a file in hdfs named data.txt in hdfs_commands. hdfs dfs -touchz hdfs_commands/data.txt
Step 3 : Now copy this data.txt file on local filesystem, however while copying file please make sure file properties are not changed e.g. file permissions.
hdfs dfs -copyToLocal -p hdfs_commands/data.txt/home/cloudera/Desktop/HadoopExam
Step 4 : Now create a file in local directory named data_local.txt and move this file to hdfs in hdfs_commands directory.
touch data_local.txt
hdfs dfs -moveFromLocal /home/cloudera/Desktop/HadoopExam/dataJocal.txt hdfs_commands/
Step 5 : Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file system.
hdfs dfs -touchz hdfscommands/data hdfs.txt
hdfs dfs -getfrdfs_commands/data_hdfs.txt /home/cloudera/Desktop/HadoopExam/
Step 6 : Create a file in local filesystem named filel .txt and put it to hdfs touch filel.txt hdfs dfs -put/home/cloudera/Desktop/HadoopExam/file1.txt hdfs_commands/
Problem Scenario 93 : You have to run your Spark application with locally 8 thread or locally on 8 cores. Replace XXX with correct values.
spark-submit --class com.hadoopexam.MyTask XXX \ -deploy-mode cluster
SSPARK_HOME/lib/hadoopexam.jar 10
See the explanation for Step by Step Solution and configuration.
XXX: -master local[8]
Notes : The master URL passed to Spark can be in one of the following formats:
Master URL Meaning
local Run Spark locally with one worker thread (i.e. no parallelism at all}.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*] Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORT Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
mesos://HOST:PORT Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using
ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
yarn Connect to a YARN cluster in client or cluster mode depending on the value of - deploy-mode. The cluster location will be found based on the HADOOP CONF DIR or
YARN CONF DIR variable.
Problem Scenario 47 : You have been given below code snippet, with intermediate output.
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: lterator[(lnt)]): lterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
//In each run , output could be different, while solving problem assume belowm output only.
res28: Array[String] = Array([partlD:0, val: 1], [partlD:0, val: 2], [partlD:0, val: 3], [partlD:1, val: 4], [partlD:1, val: S], [partlD:1, val: 6])
Now apply aggregate method on RDD z , with two reduce function , first will select max value in each partition and second will add all the maximum values from all partitions.
Initialize the aggregate with value 5. hence expected output will be 16.
z.aggregate(5)(math.max(_, J, _ + _)
Problem Scenario 17 : You have been given following mysql database details as well as other info.
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish below assignment.
1. Create a table in hive as below, create table departments_hiveOl(department_id int, department_name string, avg_salary int);
2. Create another table in mysql using below statement CREATE TABLE IF NOT EXISTS departments_hive01(id int, department_name varchar(45), avg_salary int);
3. Copy all the data from departments table to departments_hive01 using insert into departments_hive01 select a.*, null from departments a;
Also insert following records as below
insert into departments_hive01 values(777, "Not known",1000);
insert into departments_hive01 values(8888, null,1000);
insert into departments_hive01 values(666, null,1100);
4. Now import data from mysql table departments_hive01 to this hive table. Please make sure that data should be visible using below hive command. Also, while importing if null value found for department_name column replace it with "" (empty string) and for id column with -999 select * from departments_hive;
See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create hive table as below.
show tables;
create table departments_hive01(department_id int, department_name string, avgsalary int);
Step 2 : Create table in mysql db as well.
mysql -user=retail_dba -password=cloudera
use retail_db
CREATE TABLE IF NOT EXISTS departments_hive01(id int, department_name
varchar(45), avg_salary int);
show tables;
step 3 : Insert data in mysql table.
insert into departments_hive01 select a.*, null from departments a;
check data inserts
select' from departments_hive01;
Now iserts null records as given in problem. insert into departments_hive01 values(777,
"Not known",1000); insert into departments_hive01 values(8888, null,1000); insert into departments_hive01 values(666, null,1100);
Step 4 : Now import data in hive as per requirement.
sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
~ username=retail_dba \
--password=cloudera \
-table departments_hive01 \
--hive-home /user/hive/warehouse \
--hive-import \
-hive-overwrite \
-hive-table departments_hive0l \
--fields-terminated-by '\001' \
--null-string M"\
--null-non-strlng -999 \
-split-by id \
-m 1
Step 5 : Checkthe data in directory.
hdfs dfs -Is /user/hive/warehouse/departments_hive01
hdfs dfs -cat/user/hive/warehouse/departments_hive01/part"
Check data in hive table.
Select * from departments_hive01;
Problem Scenario 58 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle}}}
See the explanation for Step by Step Solution and configuration.
Solution :
groupByKey [Pair]
Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.
Listing Variants
def groupByKeyQ: RDD[(K, lterable[V]}]
def groupByKey(numPartittons: Int): RDD[(K, lterable[V] )]
def groupByKey(partitioner: Partitioner): RDD[(K, lterable[V])]

우리와 연락하기

문의할 점이 있으시면 메일을 보내오세요. 12시간이내에 답장드리도록 하고 있습니다.

근무시간: ( UTC+9 ) 9:00-24:00

서포트: 바로 연락하기