최신 Professional-Data-Engineer 무료덤프 - Google Certified Professional Data Engineer

문제1

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).
What should you do?

A. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.

B. Add a try... catch block to your DoFn that transforms the data, extract erroneous rows from logs.

C. Add a try... catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.

D. Add a try... catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

문제2

After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You've loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.
What should you do?

A. Select random samples from the tables using the RAND() function and compare the samples.

B. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.

C. Create stratified random samples using the OVER() function and compare equivalent samples from each table.

D. Select random samples from the tables using the HASH() function and compare the samples.

정답: D

문제3

You want to use Google Stackdriver Logging to monitor Google BigQuery usage. You need an instant notification to be sent to your monitoring tool when new data is appended to a certain table using an insert job, but you do not want to receive notifications for other tables. What should you do?

A. In the Stackdriver logging admin interface, and enable a log sink export to BigQuery.

B. Make a call to the Stackdriver API to list all logs, and apply an advanced filter.

C. Using the Stackdriver API, create a project sink with advanced log filter to export to Pub/Sub, and subscribe to the topic from your monitoring tool.

D. In the Stackdriver logging admin interface, enable a log sink export to Google Cloud Pub/Sub, and subscribe to the topic from your monitoring tool.

정답: A

문제4

Which of the following is NOT one of the three main types of triggers that Dataflow supports?

A. Trigger based on element count

B. Trigger based on element size in bytes

C. Trigger that is a combination of other triggers

D. Trigger based on time

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제5

You work for a farming company. You have one BigQuery table named sensors, which is about 500 MB and contains the list of your 5000 sensors, with columns for id, name, and location. This table is updated every hour. Each sensor generates one metric every 30 seconds along with a timestamp. which you want to store in BigQuery. You want to run an analytical query on the data once a week for monitoring purposes. You also want to minimize costs. What data model should you use?

A. 1. Create a retries column in the sensor? table.
2. Set record type and repeated mode for the metrics column.
3. Use an UPDATE statement every 30 seconds to add new metrics.

B. 1. Create a metrics table partitioned by timestamp.
2. Create a sensorld column in the metrics table, that points to the id column in the sensors table.
3. Use an IHSEW statement every 30 seconds to append new metrics to the metrics table.
4. Join the two tables, if needed, when running the analytical query.

C. 1. Create a metrics table partitioned by timestamp.
2. Create a sensor Id column in the metrics table, that points to the _d column in the sensors table.
3. Use an UPDATE statement every 30 seconds to append new metrics to the metrics table.
4. Join the two tables, if needed, when running the analytical query.

D. 1. Create a metrics column in the sensors table.
2. Set RECORD type and REPEATED mode for the metrics column.
3. Use an INSERT statement every 30 seconds to add new metrics.

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제6

You are integrating one of your internal IT applications and Google BigQuery, so users can query BigQuery from the application's interface. You do not want individual users to authenticate to BigQuery and you do not want to give them access to the dataset. You need to securely access BigQuery from your IT application.
What should you do?

A. Create a service account and grant dataset access to that account. Use the service account's private key to access the dataset

B. Integrate with a single sign-on (SSO) platform, and pass each user's credentials along with the query request

C. Create groups for your users and give those groups access to the dataset

D. Create a dummy user and grant dataset access to that user. Store the username and password for that user in a file on the files system, and use those credentials to access the BigQuery dataset

정답: A

문제7

When you design a Google Cloud Bigtable schema it is recommended that you _________.

A. Avoid schema designs that require atomicity across rows

B. Create schema designs that are based on a relational database design

C. Create schema designs that require atomicity across rows

D. Avoid schema designs that are based on NoSQL concepts

정답: A

설명: (DumpTOP 회원만 볼 수 있음)

문제8

You are designing a data mesh on Google Cloud with multiple distinct data engineering teams building data products. The typical data curation design pattern consists of landing files in Cloud Storage, transforming raw data in Cloud Storage and BigQuery datasets. and storing the final curated data product in BigQuery datasets You need to configure Dataplex to ensure that each team can access only the assets needed to build their data products. You also need to ensure that teams can easily share the curated data product. What should you do?

A. 1 Create a Dataplex virtual lake for each data product, and create multiple zones for landing, raw. and curated data.
2. Provide the data engineering teams with full access to the virtual lake assigned to their data product.

B. 1 Create a single Dataplex virtual lake and create a single zone to contain landing, raw. and curated data.
2 Provide each data engineering team access to the virtual lake.

C. 1 Create a Dataplex virtual lake for each data product, and create a single zone to contain landing, raw, and curated data.
2. Provide the data engineering teams with full access to the virtual lake assigned to their data product.

D. 1 Create a single Dataplex virtual lake and create a single zone to contain landing, raw. and curated data. 2 Build separate assets for each data product within the zone.
3. Assign permissions to the data engineering teams at the zone level.

정답: A

설명: (DumpTOP 회원만 볼 수 있음)

문제9

You issue a new batch job to Dataflow. The job starts successfully, processes a few elements, and then suddenly fails and shuts down. You navigate to the Dataflow monitoring interface where you find errors related to a particular DoFn in your pipeline. What is the most likely cause of the errors?

A. Job validation

B. Graph or pipeline construction

C. Insufficient permissions

D. Exceptions in worker code

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

문제10

Which of these are examples of a value in a sparse vector? (Select 2 answers.)

A. [0, 0, 0, 1, 0, 0, 1]

B. [1, 0, 0, 0, 0, 0, 0]

C. [0, 5, 0, 0, 0, 0]

D. [0, 1]

정답: B,D

설명: (DumpTOP 회원만 볼 수 있음)

문제11

You are using Workflows to call an API that returns a 1 KB JSON response, apply some complex business logic on this response, wait for the logic to complete, and then perform a load from a Cloud Storage file to BigQuery. The Workflows standard library does not have sufficient capabilities to perform your complex logic, and you want to use Python's standard library instead. You want to optimize your workflow for simplicity and speed of execution. What should you do?

A. Invoke a Cloud Function instance that uses Python to apply the logic on your JSON file.

B. Create a Cloud Composer environment and run the logic in Cloud Composer.

C. Create a Dataproc cluster, and use PySpark to apply the logic on your JSON file.

D. Invoke a subworkflow in Workflows to apply the logic on your JSON file.

정답: A

문제12

You are implementing workflow pipeline scheduling using open source-based tools and Google Kubernetes Engine (GKE). You want to use a Google managed service to simplify and automate the task. You also want to accommodate Shared VPC networking considerations. What should you do?

A. Use Dataflow for your workflow pipelines. Use Cloud Run triggers for scheduling.

B. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the host project.

C. Use Dataflow for your workflow pipelines. Use shell scripts to schedule workflows.

D. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

문제13

Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

A. Randomization

B. Field promotion

C. Salting

D. Hashing

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제14

Does Dataflow process batch data pipelines or streaming data pipelines?

A. Only Streaming Data Pipelines

B. Both Batch and Streaming Data Pipelines

C. Only Batch Data Pipelines

D. None of the above

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제15

You are working on a linear regression model on BigQuery ML to predict a customer's likelihood of purchasing your company's products. Your model uses a city name variable as a key predictive component in order to train and serve the model your data must be organized in columns. You want to prepare your data using the least amount of coding while maintaining the predictable variables. What should you do?

A. Use SQL in BigQuery to transform the stale column using a one-hot encoding method, and make each city a column with binary values.

B. Use TensorFlow to create a categorical variable with a vocabulary list. Create the vocabulary file and upload that as part of your model to BigQuery ML.

C. Create a new view with BigQuery that does not include a column which city information.

D. Cloud Data Fusion to assign each city to a region that is labeled as 1, 2 3, 4, or 5, and then use that number to represent the city in the model.

정답: D

문제16

You need (o give new website users a globally unique identifier (GUID) using a service that takes in data points and returns a GUID This data is sourced from both internal and external systems via HTTP calls that you will make via microservices within your pipeline There will be tens of thousands of messages per second and that can be multithreaded, and you worry about the backpressure on the system How should you design your pipeline to minimize that backpressure?

A. Batch the job into ten-second increments

B. Call out to the service via HTTP

C. Create the pipeline statically in the class definition

D. Create a new object in the startBundle method of DoFn

정답: B

문제17

What are the minimum permissions needed for a service account used with Google Dataproc?

A. Write to Google Cloud Storage; read to Google Cloud Logging

B. Read and write to Google Cloud Storage; write to Google Cloud Logging

C. Execute to Google Cloud Storage; write to Google Cloud Logging

D. Execute to Google Cloud Storage; execute to Google Cloud Logging

정답: B

설명: (DumpTOP 회원만 볼 수 있음)

문제18

You have a Standard Tier Memorystore for Redis instance deployed in a production environment. You need to simulate a Redis instance failover in the most accurate disaster recovery situation, and ensure that the failover has no impact on production dat a. What should you do?

A. Create a Standard Tier Memorystore for Redis instance in a development environment. Initiate a manual failover by using the force-data-loss data protection mode.

B. Initiate a manual tailover by using the limited-data-loss data protection mode to the Memorystore for Redis instance in the production environment.

C. Increase one replica to Redis instance in production environment. Initiate a manual failover by using the force-data-loss data protection mode.

D. Create a Standard Tier Memorystore for Redis instance in the development environment. Initiate a manual failover by using the limited-data-loss data protection mode.

정답: D

설명: (DumpTOP 회원만 볼 수 있음)

문제19

You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket
Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

A. Use a session window with a gap time duration of 60 minutes.

B. Use a global window with a time based trigger with a delay of 60 minutes.

C. Use a fixed-time window with a duration of 60 minutes.

D. Use a sliding time window with a duration of 60 minutes.

정답: A

최신 Professional-Data-Engineer 무료덤프 - Google Certified Professional Data Engineer

우리와 연락하기

유용한 링크

최신 업데이트