최신 CCA175 무료덤프 - Cloudera CCA Spark and Hadoop Developer

CORRECT TEXT
Problem Scenario 88 : You have been given below three files
product.csv (Create this file in hdfs)
productID,productCode,name,quantity,price,supplierid
1001,PEN,Pen Red,5000,1.23,501
1002,PEN,Pen Blue,8000,1.25,501
1003,PEN,Pen Black,2000,1.25,501
1004,PEC,Pencil 2B,10000,0.48,502
1005,PEC,Pencil 2H,8000,0.49,502
1006,PEC,Pencil HB,0,9999.99,502
2001,PEC,Pencil 3B,500,0.52,501
2002,PEC,Pencil 4B,200,0.62,501
2003,PEC,Pencil 5B,100,0.73,501
2004,PEC,Pencil 6B,500,0.47,502
supplier.csv
supplierid,name,phone
501,ABC Traders,88881111
502,XYZ Company,88882222
503,QQ Corp,88883333
products_suppliers.csv
productID,supplierID
2001,501
2002,501
2003,501
2004,502
2001,503
Now accomplish all the queries given in solution.
1. It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
2. Find all the supllier name, who are supplying 'Pencil 3B'
3. Find all the products , which are supplied by ABC Traders.
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.
val results = sqlContext.sql(......SELECT products.name AS Product Name', price, suppliers.name AS Supplier Name'
FROM products_suppliers
JOIN products ON products_suppliers.productlD = products.productID JOIN suppliers ON products_suppliers.supplierlD = suppliers.supplierlD null t results.show()
Step 2 : Find all the supllier name, who are supplying 'Pencil 3B'
val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier
Name'
FROM products_suppliers AS ps
JOIN products AS p ON ps.productID = p.productID
JOIN suppliers AS s ON ps.supplierlD = s.supplierlD
WHERE p.name = 'Pencil 3B"",M )
results.show()
Step 3 : Find all the products , which are supplied by ABC Traders.
val results = sqlContext.sql(......SELECT p.name AS 'Product Name", s.name AS "Supplier
Name'
FROM products AS p, products_suppliers AS ps, suppliers AS s WHERE p.productID = ps.productID AND ps.supplierlD = s.supplierlD
AND s.name = 'ABC Traders".....)
results. show()
CORRECT TEXT
Problem Scenario 47 : You have been given below code snippet, with intermediate output.
val z = sc.parallelize(List(1,2,3,4,5,6), 2)
// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: lterator[(lnt)]): lterator[String] = {
iter.toList.map(x => "[partID:" + index + ", val: " + x + "]").iterator
}
//In each run , output could be different, while solving problem assume belowm output only.
z.mapPartitionsWithlndex(myfunc).collect
res28: Array[String] = Array([partlD:0, val: 1], [partlD:0, val: 2], [partlD:0, val: 3], [partlD:1, val: 4], [partlD:1, val: S], [partlD:1, val: 6])
Now apply aggregate method on RDD z , with two reduce function , first will select max value in each partition and second will add all the maximum values from all partitions.
Initialize the aggregate with value 5. hence expected output will be 16.
정답:
z.aggregate(5)(math.max(_, J, _ + _)
CORRECT TEXT
Problem Scenario 73 : You have been given data in json format as below.
{"first_name":"Ankit", "last_name":"Jain"}
{"first_name":"Amir", "last_name":"Khan"}
{"first_name":"Rajesh", "last_name":"Khanna"}
{"first_name":"Priynka", "last_name":"Chopra"}
{"first_name":"Kareena", "last_name":"Kapoor"}
{"first_name":"Lokesh", "last_name":"Yadav"}
Do the following activity
1 . create employee.json file locally.
2 . Load this file on hdfs
3 . Register this data as a temp table in Spark using Python.
4 . Write select query and print this data.
5 . Now save back this selected data in json format.
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : create employee.json tile locally.
vi employee.json (press insert) past the content.
Step 2 : Upload this tile to hdfs, default location hadoop fs -put employee.json
Step 3 : Write spark script
#lmport SQLContext
from pyspark import SQLContext
# Create instance of SQLContext sqIContext = SQLContext(sc)
# Load json file
employee = sqlContext.jsonFile("employee.json")
# Register RDD as a temp table employee.registerTempTablef'EmployeeTab"}
# Select data from Employee table
employeelnfo = sqlContext.sql("select * from EmployeeTab"}
#lterate data and print
for row in employeelnfo.collect():
print(row)
Step 4 : Write dataas a Text file employeelnfo.toJSON().saveAsTextFile("employeeJson1")
Step 5: Check whether data has been created or not hadoop fs -cat employeeJsonl/part"
CORRECT TEXT
Problem Scenario 36 : You have been given a file named spark8/data.csv (type,name).
data.csv
1 ,Lokesh
2 ,Bhupesh
2 ,Amit
2 ,Ratan
2 ,Dinesh
1 ,Pavan
1 ,Tejas
2 ,Sheela
1 ,Kumar
1 ,Venkat
1. Load this file from hdfs and save it back as (id, (all names of same type)) in results directory. However, make sure while saving it should be
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create file in hdfs (We will do using Hue). However, you can first create in local filesystem and then upload it to hdfs.
Step 2 : Load data.csv file from hdfs and create PairRDDs
val name = sc.textFile("spark8/data.csv")
val namePairRDD = name.map(x=> (x.split(",")(0),x.split(",")(1)))
Step 3 : Now swap namePairRDD RDD.
val swapped = namePairRDD.map(item => item.swap)
Step 4 : Now combine the rdd by key.
val combinedOutput = namePairRDD.combineByKey(List(_), (x:List[String], y:String) => y ::
x, (x:List[String], y:List[String]) => x ::: y)
Step 5 : Save the output as a Text file and output must be written in a single file.
:ombinedOutput.repartition(1).saveAsTextFile("spark8/result.txt")
CORRECT TEXT
Problem Scenario 95 : You have to run your Spark application on yarn with each executor
Maximum heap size to be 512MB and Number of processor cores to allocate on each executor will be 1 and Your main application required three values as input arguments V1
V2 V3.
Please replace XXX, YYY, ZZZ
./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3
--driver-memory 512m XXX YYY lib/hadoopexam.jarZZZ
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution
XXX: -executor-memory 512m YYY: -executor-cores 1
ZZZ : V1 V2 V3
Notes : spark-submit on yarn options Option Description
archives Comma-separated list of archives to be extracted into the working directory of each executor. The path must be globally visible inside your cluster; see Advanced
Dependency Management.
executor-cores Number of processor cores to allocate on each executor. Alternatively, you can use the spark.executor.cores property, executor-memory Maximum heap size to allocate to each executor. Alternatively, you can use the spark.executor.memory-property.
num-executors Total number of YARN containers to allocate for this application.
Alternatively, you can use the spark.executor.instances property. queue YARN queue to submit to. For more information, see Assigning Applications and Queries to Resource
Pools. Default: default.
CORRECT TEXT
Problem Scenario 27 : You need to implement near real time solutions for collecting information when submitted in file with below information.
Data
echo "IBM,100,20160104" >> /tmp/spooldir/bb/.bb.txt
echo "IBM,103,20160105" >> /tmp/spooldir/bb/.bb.txt
mv /tmp/spooldir/bb/.bb.txt /tmp/spooldir/bb/bb.txt
After few mins
echo "IBM,100.2,20160104" >> /tmp/spooldir/dr/.dr.txt
echo "IBM,103.1,20160105" >> /tmp/spooldir/dr/.dr.txt
mv /tmp/spooldir/dr/.dr.txt /tmp/spooldir/dr/dr.txt
Requirements:
You have been given below directory location (if not available than create it) /tmp/spooldir .
You have a finacial subscription for getting stock prices from BloomBerg as well as
Reuters and using ftp you download every hour new files from their respective ftp site in directories /tmp/spooldir/bb and /tmp/spooldir/dr respectively.
As soon as file committed in this directory that needs to be available in hdfs in
/tmp/flume/finance location in a single directory.
Write a flume configuration file named flume7.conf and use it to load data in hdfs with following additional properties .
1 . Spool /tmp/spooldir/bb and /tmp/spooldir/dr
2 . File prefix in hdfs sholuld be events
3 . File suffix should be .log
4 . If file is not commited and in use than it should have _ as prefix.
5 . Data should be written as text to hdfs
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create directory mkdir /tmp/spooldir/bb mkdir /tmp/spooldir/dr
Step 2 : Create flume configuration file, with below configuration for
agent1.sources = source1 source2
agent1 .sinks = sink1
agent1.channels = channel1
agent1 .sources.source1.channels = channel1
agentl .sources.source2.channels = channell agent1 .sinks.sinkl.channel = channell agent1 .sources.source1.type = spooldir agent1 .sources.sourcel.spoolDir = /tmp/spooldir/bb agent1 .sources.source2.type = spooldir
agent1 .sources.source2.spoolDir = /tmp/spooldir/dr
agent1 .sinks.sink1.type = hdfs
agent1 .sinks.sink1.hdfs.path = /tmp/flume/finance
agent1-sinks.sink1.hdfs.filePrefix = events
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1 .sinks.sink1.hdfs.inUsePrefix = _
agent1 .sinks.sink1.hdfs.fileType = Data Stream
agent1.channels.channel1.type = file
Step 4 : Run below command which will use this configuration file and append data in hdfs.
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file
/home/cloudera/fIumeconf/fIume7.conf --name agent1
Step 5 : Open another terminal and create a file in /tmp/spooldir/
echo "IBM,100,20160104" > /tmp/spooldir/bb/.bb.txt
echo "IBM,103,20160105" > /tmp/spooldir/bb/.bb.txt mv /tmp/spooldir/bb/.bb.txt
/tmp/spooldir/bb/bb.txt
After few mins
echo "IBM,100.2,20160104" > /tmp/spooldir/dr/.dr.txt
echo "IBM,103.1,20160105" >/tmp/spooldir/dr/.dr.txt mv /tmp/spooldir/dr/.dr.txt
/tmp/spooldir/dr/dr.txt
CORRECT TEXT
Problem Scenario 1:
You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1 . Connect MySQL DB and check the content of the tables.
2 . Copy "retaildb.categories" table to hdfs, without specifying directory name.
3 . Copy "retaildb.categories" table to hdfs, in a directory name "categories_target".
4 . Copy "retaildb.categories" table to hdfs, in a warehouse directory name
"categories_warehouse".
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Connecting to existing MySQL Database mysql --user=retail_dba -- password=cloudera retail_db
Step 2 : Show all the available tables show tables;
Step 3 : View/Count data from a table in MySQL select count(1} from categories;
Step 4 : Check the currently available data in HDFS directory hdfs dfs -Is
Step 5 : Import Single table (Without specifying directory).
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=categories
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 6 : Read the data from one of the partition, created using above command, hdfs dfs - catxategories/part-m-00000
Step 7 : Specifying target directory in import command (We are using number of mappers
= 1, you can change accordingly) sqoop import -connect
jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera
~ table=categories -target-dir=categortes_target --m 1
Step 8 : Check the content in one of the partition file.
hdfs dfs -cat categories_target/part-m-00000
Step 9 : Specifying parent directory so that you can copy more than one table in a specified target directory. Command to specify warehouse directory.
sqoop import -.-connect jdbc:mysql://quickstart:3306/retail_db --username=retail dba - password=cloudera -table=categories -warehouse-dir=categories_warehouse --m 1
CORRECT TEXT
Problem Scenario 61 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3) val d = c.keyBy(_.length) operationl
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, (String, Option[String]}}] = Array((6,(salmon,Some(salmon))),
(6,(salmon,Some(rabbit))),
(6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))),
(6,(salmon,Some(turkey))), (3,(dog,Some(dog))), (3,(dog,Some(cat))),
(3,(dog,Some(dog))), (3,(dog,Some(bee))), (3,(rat,Some(dogg)), (3,(rat,Some(cat)j),
(3,(rat.Some(gnu))). (3,(rat,Some(bee))), (8,(elephant,None)))
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.leftOuterJoin(d}.collect
leftOuterJoin [Pair]: Performs an left outer join using two key-value RDDs. Please note that the keys must be generally comparable to make this work keyBy : Constructs two- component tuples (key-value pairs) by applying a function on each data item. Trie result of the function becomes the key and the original data item becomes the value of the newly created tuples.
CORRECT TEXT
Problem Scenario 17 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish below assignment.
1. Create a table in hive as below, create table departments_hiveOl(department_id int, department_name string, avg_salary int);
2. Create another table in mysql using below statement CREATE TABLE IF NOT EXISTS departments_hive01(id int, department_name varchar(45), avg_salary int);
3. Copy all the data from departments table to departments_hive01 using insert into departments_hive01 select a.*, null from departments a;
Also insert following records as below
insert into departments_hive01 values(777, "Not known",1000);
insert into departments_hive01 values(8888, null,1000);
insert into departments_hive01 values(666, null,1100);
4. Now import data from mysql table departments_hive01 to this hive table. Please make sure that data should be visible using below hive command. Also, while importing if null value found for department_name column replace it with "" (empty string) and for id column with -999 select * from departments_hive;
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create hive table as below.
hive
show tables;
create table departments_hive01(department_id int, department_name string, avgsalary int);
Step 2 : Create table in mysql db as well.
mysql -user=retail_dba -password=cloudera
use retail_db
CREATE TABLE IF NOT EXISTS departments_hive01(id int, department_name
varchar(45), avg_salary int);
show tables;
step 3 : Insert data in mysql table.
insert into departments_hive01 select a.*, null from departments a;
check data inserts
select' from departments_hive01;
Now iserts null records as given in problem. insert into departments_hive01 values(777,
"Not known",1000); insert into departments_hive01 values(8888, null,1000); insert into departments_hive01 values(666, null,1100);
Step 4 : Now import data in hive as per requirement.
sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
~ username=retail_dba \
--password=cloudera \
-table departments_hive01 \
--hive-home /user/hive/warehouse \
--hive-import \
-hive-overwrite \
-hive-table departments_hive0l \
--fields-terminated-by '\001' \
--null-string M"\
--null-non-strlng -999 \
-split-by id \
-m 1
Step 5 : Checkthe data in directory.
hdfs dfs -Is /user/hive/warehouse/departments_hive01
hdfs dfs -cat/user/hive/warehouse/departments_hive01/part"
Check data in hive table.
Select * from departments_hive01;
CORRECT TEXT
Problem Scenario 90 : You have been given below two files
course.txt
id,course
1 ,Hadoop
2 ,Spark
3 ,HBase
fee.txt
id,fee
2,3900
3,4200
4,2900
Accomplish the following activities.
1. Select all the courses and their fees , whether fee is listed or not.
2. Select all the available fees and respective course. If course does not exists still list the fee
3. Select all the courses and their fees , whether fee is listed or not. However, ignore records having fee as null.
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1:
hdfs dfs -mkdir sparksql4
hdfs dfs -put course.txt sparksql4/
hdfs dfs -put fee.txt sparksql4/
Step 2 : Now in spark shell
// load the data into a new RDD
val course = sc.textFile("sparksql4/course.txt")
val fee = sc.textFile("sparksql4/fee.txt")
// Return the first element in this RDD
course.fi rst()
fee.fi rst()
//define the schema using a case class case class Course(id: Integer, name: String) case class Fee(id: Integer, fee: Integer)
// create an RDD of Product objects
val courseRDD = course.map(_.split(",")).map(c => Course(c(0).tolnt,c(1))) val feeRDD =fee.map(_.split(",")).map(c => Fee(c(0}.tolnt,c(1}.tolnt)) courseRDD.first() courseRDD.count(}
feeRDD.first()
feeRDD.countQ
// change RDD of Product objects to a DataFrame val courseDF = courseRDD.toDF(} val feeDF = feeRDD.toDF{)
// register the DataFrame as a temp table courseDF. registerTempTable("course") feeDF.
registerTempTablef'fee")
// Select data from table
val results = sqlContext.sql(......SELECT' FROM course """ )
results. showQ
val results = sqlContext.sql(......SELECT' FROM fee......)
results. showQ
val results = sqlContext.sql(......SELECT * FROM course LEFT JOIN fee ON course.id = fee.id......) results-showQ val results ="sqlContext.sql(......SELECT * FROM course RIGHT JOIN fee ON course.id = fee.id "MM ) results. showQ val results = sqlContext.sql(......SELECT' FROM course LEFT JOIN fee ON course.id = fee.id where fee.id IS NULL" results. show()

우리와 연락하기

문의할 점이 있으시면 메일을 보내오세요. 12시간이내에 답장드리도록 하고 있습니다.

근무시간: ( UTC+9 ) 9:00-24:00
월요일~토요일

서포트: 바로 연락하기