최신 CCA175 무료덤프 - Cloudera CCA Spark and Hadoop Developer

CORRECT TEXT
Problem Scenario 67 : You have been given below code snippet.
lines = sc.parallelize(['lts fun to have fun,','but you have to know how.'])
M = lines.map( lambda x: x.replace(',7 ').replace('.',' 'J.replaceC-V ').lower()) r2 = r1.flatMap(lambda x: x.split()) r3 = r2.map(lambda x: (x, 1)) operation1
r5 = r4.map(lambda x:(x[1],x[0]))
r6 = r5.sortByKey(ascending=False)
r6.take(20)
Write a correct code snippet for operationl which will produce desired output, shown below.
[(2, 'fun'), (2, 'to'), (2, 'have'), (1, its'), (1, 'know1), (1, 'how1), (1, 'you'), (1, 'but')]
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
r4 = r3.reduceByKey(lambda x,y:x+y)
CORRECT TEXT
Problem Scenario 40 : You have been given sample data as below in a file called spark15/file1.txt
3070811,1963,1096,,"US","CA",,1,
3022811,1963,1096,,"US","CA",,1,56
3033811,1963,1096,,"US","CA",,1,23
Below is the code snippet to process this tile.
val field= sc.textFile("spark15/f ilel.txt")
val mapper = field.map(x=> A)
mapper.map(x => x.map(x=> {B})).collect
Please fill in A and B so it can generate below final output
Array(Array(3070811,1963,109G, 0, "US", "CA", 0,1, 0)
,Array(3022811,1963,1096, 0, "US", "CA", 0,1, 56)
,Array(3033811,1963,1096, 0, "US", "CA", 0,1, 23)
)
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
A. x.split(","-1)
B. if (x. isEmpty) 0 else x
CORRECT TEXT
Problem Scenario 71 :
Write down a Spark script using Python,
In which it read a file "Content.txt" (On hdfs) with following content.
After that split each row as (key, value), where key is first word in line and entire line as value.
Filter out the empty lines.
And save this key value in "problem86" as Sequence file(On hdfs)
Part 2 : Save as sequence file , where key as null and entire line as value. Read back the stored sequence files.
Content.txt
Hello this is ABCTECH.com
This is XYZTECH.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 :
# Import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
Step 2:
#load data from hdfs
contentRDD = sc.textFile(MContent.txt")
Step 3:
#filter out non-empty lines
nonemptyjines = contentRDD.filter(lambda x: len(x) > 0)
Step 4:
#Split line based on space (Remember : It is mandatory to convert is in tuple} words = nonempty_lines.map(lambda x: tuple(x.split('', 1))) words.saveAsSequenceFile("problem86")
Step 5: Check contents in directory problem86 hdfs dfs -cat problem86/part*
Step 6 : Create key, value pair (where key is null)
nonempty_lines.map(lambda line: (None, Mne}).saveAsSequenceFile("problem86_1")
Step 7 : Reading back the sequence file data using spark. seqRDD =
sc.sequenceFile("problem86_1")
Step 8 : Print the content to validate the same.
for line in seqRDD.collect():
print(line)
CORRECT TEXT
Problem Scenario 58 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)),
(3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle}}}
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.groupByKey.collect
groupByKey [Pair]
Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.
Listing Variants
def groupByKeyQ: RDD[(K, lterable[V]}]
def groupByKey(numPartittons: Int): RDD[(K, lterable[V] )]
def groupByKey(partitioner: Partitioner): RDD[(K, lterable[V])]
CORRECT TEXT
Problem Scenario 65 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.tolnt, 2)
val c = a.zip(b)
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2>, (ant,5))
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution : c.sortByKey(false).collect
sortByKey [Ordered] : This function sorts the input RDD's data and stores it in a new RDD.
"The output RDD is a shuffled RDD because it stores data that is output by a reducer which has been shuffled. The implementation of this function is actually very clever.
First, it uses a range partitioner to partition the data in ranges within the shuffled RDD.
Then it sorts these ranges individually with mapPartitions using standard sort mechanisms.
CORRECT TEXT
Problem Scenario 11 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following.
1. Import departments table in a directory called departments.
2. Once import is done, please insert following 5 records in departments mysql table.
Insert into departments(10, physics);
Insert into departments(11, Chemistry);
Insert into departments(12, Maths);
Insert into departments(13, Science);
Insert into departments(14, Engineering);
3. Now import only new inserted records and append to existring directory . which has been created in first step.
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Clean already imported data. (In real exam, please make sure you dont delete data generated from previous exercise).
hadoop fs -rm -R departments
Step 2 : Import data in departments directory.
sqoop import \
--connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
-password=cloudera \
-table departments \
"target-dir/user/cloudera/departments
Step 3 : Insert the five records in departments table.
mysql -user=retail_dba --password=cloudera retail_db
Insert into departments values(10, "physics"); Insert into departments values(11,
"Chemistry"); Insert into departments values(12, "Maths"); Insert into departments values(13, "Science"); Insert into departments values(14, "Engineering"); commit; select' from departments;
Step 4 : Get the maximum value of departments from last import, hdfs dfs -cat
/user/cloudera/departments/part* that should be 7
Step 5 : Do the incremental import based on last import and append the results.
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:330G/retail_db" \
~ username=retail_dba \
-password=cloudera \
-table departments \
--target-dir /user/cloudera/departments \
-append \
-check-column "department_id" \
-incremental append \
-last-value 7
Step 6 : Now check the result.
hdfs dfs -cat /user/cloudera/departments/part"
CORRECT TEXT
Problem Scenario 87 : You have been given below three files
product.csv (Create this file in hdfs)
productID,productCode,name,quantity,price,supplierid
1 001,PEN,Pen Red,5000,1.23,501
1 002,PEN,Pen Blue,8000,1.25,501
1003,PEN,Pen Black,2000,1.25,501
1004,PEC,Pencil 2B,10000,0.48,502
1005,PEC,Pencil 2H,8000,0.49,502
1006,PEC,Pencil HB,0,9999.99,502
2001,PEC,Pencil 3B,500,0.52,501
2002,PEC,Pencil 4B,200,0.62,501
2003,PEC,Pencil 5B,100,0.73,501
2004,PEC,Pencil 6B,500,0.47,502
supplier.csv
supplierid,name,phone
501,ABC Traders,88881111
502,XYZ Company,88882222
503,QQ Corp,88883333
products_suppliers.csv
productID,supplierID
2001,501
2002,501
2003,501
2004,502
2001,503
Now accomplish all the queries given in solution.
Select product, its price , its supplier name where product price is less than 0.6 using
SparkSQL
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1:
hdfs dfs -mkdir sparksql2
hdfs dfs -put product.csv sparksq!2/
hdfs dfs -put supplier.csv sparksql2/
hdfs dfs -put products_suppliers.csv sparksql2/
Step 2 : Now in spark shell
// this Is used to Implicitly convert an RDD to a DataFrame.
import sqlContext.impIicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// load the data into a new RDD
val products = sc.textFile("sparksql2/product.csv")
val supplier = sc.textFileC'sparksq^supplier.csv")
val prdsup = sc.textFile("sparksql2/products_suppliers.csv"}
// Return the first element in this RDD
products.fi rst()
supplier.first{).
prdsup.first()
//define the schema using a case class
case class Product(productid: Integer, code: String, name: String, quantity:lnteger, price:
Float, supplierid:lnteger)
case class Suplier(supplierid: Integer, name: String, phone: String)
case class PRDSUP(productid: Integer.supplierid: Integer)
// create an RDD of Product objects
val prdRDD = products.map(_.split('\")).map(p =>
Product(p(0).tolnt,p(1),p(2),p(3).tolnt,p(4).toFloat,p(5).toint))
val supRDD = supplier.map(_.split(",")).map(p => Suplier(p(0).tolnt,p(1),p(2))) val prdsupRDD = prdsup.map(_.split(",")).map(p => PRDSUP(p(0).tolnt,p(1}.tolnt}} prdRDD.first() prdRDD.count() supRDD.first() supRDD.count()
prdsupRDD.first() prdsupRDD.count(}
// change RDD of Product objects to a DataFrame
val prdDF = prdRDD.toDF()
val supDF = supRDD.toDF()
val prdsupDF = prdsupRDD.toDF()
// register the DataFrame as a temp table prdDF.registerTempTablef'products") supDF.registerTempTablef'suppliers") prdsupDF.registerTempTablef'productssuppliers"}
//Select product, its price , its supplier name where product price is less than 0.6 val results = sqlContext.sql(......SELECT products.name, price, suppliers.name as sup_name FROM products JOIN suppliers ON products.supplierlD= suppliers.supplierlD
WHERE price < 0.6......]
results. show()
CORRECT TEXT
Problem Scenario GG : You have been given below code snippet.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2) val b = a.keyBy(_.length) val c = sc.parallelize(List("ant", "falcon", "squid"), 2) val d = c.keyBy(.length)
operation 1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, String)] = Array((4,lion))
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.subtractByKey(d).collect
subtractByKey [Pair] : Very similar to subtract, but instead of supplying a function, the key- component of each pair will be automatically used as criterion for removing items from the first RDD.
CORRECT TEXT
Problem Scenario 18 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Now accomplish following activities.
1. Create mysql table as below.
mysql --user=retail_dba -password=cloudera
use retail_db
CREATE TABLE IF NOT EXISTS departments_hive02(id int, department_name
varchar(45), avg_salary int);
show tables;
2. Now export data from hive table departments_hive01 in departments_hive02. While exporting, please note following. wherever there is a empty string it should be loaded as a null value in mysql.
wherever there is -999 value for int field, it should be created as null value.
정답:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create table in mysql db as well.
mysql ~user=retail_dba -password=cloudera
use retail_db
CREATE TABLE IF NOT EXISTS departments_hive02(id int, department_name
varchar(45), avg_salary int);
show tables;
Step 2 : Now export data from hive table to mysql table as per the requirement.
sqoop export --connect jdbc:mysql://quickstart:3306/retail_db \
-username retaildba \
-password cloudera \
--table departments_hive02 \
-export-dir /user/hive/warehouse/departments_hive01 \
-input-fields-terminated-by '\001' \
--input-Iines-terminated-by '\n' \
--num-mappers 1 \
-batch \
-Input-null-string "" \
-input-null-non-string -999
step 3 : Now validate the data,select * from departments_hive02;
CORRECT TEXT
Problem Scenario 21 : You have been given log generating service as below.
startjogs (It will generate continuous logs)
tailjogs (You can check , what logs are being generated)
stopjogs (It will stop the log service)
Path where logs are generated using above service : /opt/gen_logs/logs/access.log
Now write a flume configuration file named flumel.conf , using that configuration file dumps logs in HDFS file system in a directory called flumel. Flume channel should have following property as well. After every 100 message it should be committed, use non-durable/faster channel and it should be able to hold maximum 1000 events
Solution :
Step 1 : Create flume configuration file, with below configuration for source, sink and channel.
#Define source , sink , channel and agent,
agent1 .sources = source1
agent1 .sinks = sink1
agent1.channels = channel1
# Describe/configure source1
agent1 .sources.source1.type = exec
agent1.sources.source1.command = tail -F /opt/gen logs/logs/access.log
## Describe sinkl
agentl .sinks.sinkl.channel = memory-channel
agentl .sinks.sinkl .type = hdfs
agentl .sinks.sink1.hdfs.path = flumel
agentl .sinks.sinkl.hdfs.fileType = Data Stream
# Now we need to define channell property.
agent1.channels.channel1.type = memory
agent1.channels.channell.capacity = 1000
agent1.channels.channell.transactionCapacity = 100
# Bind the source and sink to the channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
Step 2 : Run below command which will use this configuration file and append data in hdfs.
Start log service using : startjogs
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file
/home/cloudera/flumeconf/flumel.conf-Dflume.root.logger=DEBUG,INFO,console
Wait for few mins and than stop log service.
Stop_logs
정답:
See the explanation for Step by Step Solution and configuration.

우리와 연락하기

문의할 점이 있으시면 메일을 보내오세요. 12시간이내에 답장드리도록 하고 있습니다.

근무시간: ( UTC+9 ) 9:00-24:00
월요일~토요일

서포트: 바로 연락하기