최신 Associate-Developer-Apache-Spark-3.5 무료덤프 - Databricks Certified Associate Developer for Apache Spark 3.5 - Python
What is the relationship between jobs, stages, and tasks during execution in Apache Spark?
Options:
Options:
정답: C
설명: (DumpTOP 회원만 볼 수 있음)
What is the difference betweendf.cache()anddf.persist()in Spark DataFrame?
정답: B
설명: (DumpTOP 회원만 볼 수 있음)
Given:
python
CopyEdit
spark.sparkContext.setLogLevel("<LOG_LEVEL>")
Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?
python
CopyEdit
spark.sparkContext.setLogLevel("<LOG_LEVEL>")
Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?
정답: D
설명: (DumpTOP 회원만 볼 수 있음)
Which command overwrites an existing JSON file when writing a DataFrame?
정답: B
설명: (DumpTOP 회원만 볼 수 있음)
Given the code:

df = spark.read.csv("large_dataset.csv")
filtered_df = df.filter(col("error_column").contains("error"))
mapped_df = filtered_df.select(split(col("timestamp")," ").getItem(0).alias("date"), lit(1).alias("count")) reduced_df = mapped_df.groupBy("date").sum("count") reduced_df.count() reduced_df.show() At which point will Spark actually begin processing the data?

df = spark.read.csv("large_dataset.csv")
filtered_df = df.filter(col("error_column").contains("error"))
mapped_df = filtered_df.select(split(col("timestamp")," ").getItem(0).alias("date"), lit(1).alias("count")) reduced_df = mapped_df.groupBy("date").sum("count") reduced_df.count() reduced_df.show() At which point will Spark actually begin processing the data?
정답: C
설명: (DumpTOP 회원만 볼 수 있음)
A data engineer is running a Spark job to process a dataset of 1 TB stored in distributed storage. The cluster has 10 nodes, each with 16 CPUs. Spark UI shows:
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?
Low number of Active Tasks
Many tasks complete in milliseconds
Fewer tasks than available CPUs
Which approach should be used to adjust the partitioning for optimal resource allocation?
정답: C
설명: (DumpTOP 회원만 볼 수 있음)
An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.
The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is loaded?
The initial code is:

def in_spanish_inner(df: pd.Series) -> pd.Series:
model = get_translation_model(target_lang='es')
return df.apply(model)
in_spanish = sf.pandas_udf(in_spanish_inner, StringType())
How can the MLOps engineer change this code to reduce how many times the language model is loaded?
정답: D
설명: (DumpTOP 회원만 볼 수 있음)
How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?
Options:
Options:
정답: A
설명: (DumpTOP 회원만 볼 수 있음)