
Get Instant Access to Professional-Data-Engineer Practice Exam Questions
Reliable Study Materials & Testing Engine for Professional-Data-Engineer Exam Success!
NEW QUESTION # 132
Suppose you have a dataset of images that are each labeled as to whether or not they contain a human face. To create a neural network that recognizes human faces in images using this labeled dataset, what approach would likely be the most effective?
- A. Build a neural network with an input layer of pixels, a hidden layer, and an output layer with two categories.
- B. Use deep learning by creating a neural network with multiple hidden layers to automatically detect features of faces.
- C. Use feature engineering to add features for eyes, noses, and mouths to the input data.
- D. Use K-means Clustering to detect faces in the pixels.
Answer: B
Explanation:
Explanation
Traditional machine learning relies on shallow nets, composed of one input and one output layer, and at most one hidden layer in between. More than three layers (including input and output) qualifies as "deep" learning.
So deep is a strictly defined, technical term that means more than one hidden layer.
In deep-learning networks, each layer of nodes trains on a distinct set of features based on the previous layer's output. The further you advance into the neural net, the more complex the features your nodes can recognize, since they aggregate and recombine features from the previous layer.
A neural network with only one hidden layer would be unable to automatically recognize high-level features of faces, such as eyes, because it wouldn't be able to "build" these features using previous hidden layers that detect low-level features, such as lines.
Feature engineering is difficult to perform on raw image data.
K-means Clustering is an unsupervised learning method used to categorize unlabeled data.
Reference: https://deeplearning4j.org/neuralnet-overview
NEW QUESTION # 133
Your company is currently setting up data pipelines for their campaign. For all the Google Cloud Pub/Sub
streaming data, one of the important business requirements is to be able to periodically identify the inputs
and their timings during their campaign. Engineers have decided to use windowing and transformation in
Google Cloud Dataflow for this purpose. However, when testing this feature, they find that the Cloud
Dataflow job fails for the all streaming insert. What is the most likely cause of this problem?
- A. They have not set the triggers to accommodate the data coming in late, which causes the job to fail
- B. They have not applied a global windowing function, which causes the job to fail when the pipeline is
created - C. They have not assigned the timestamp, which causes the job to fail
- D. They have not applied a non-global windowing function, which causes the job to fail when the pipeline
is created
Answer: B
NEW QUESTION # 134
The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster ____.
- A. conditional node
- B. application node
- C. worker node
- D. master node
Answer: D
Explanation:
The YARN ResourceManager and the HDFS NameNode interfaces are available on a Cloud Dataproc cluster master node. The cluster master-host-name is the name of your Cloud Dataproc cluster followed by an -m suffix-for example, if your cluster is named "my-cluster", the master-host-name would be "my-cluster-m".
NEW QUESTION # 135
Flowlogistic wants to use Google BigQuery as their primary analysis system, but they still have Apache Hadoop and Spark workloads that they cannot move to BigQuery. Flowlogistic does not know how to store the data that is common to both workloads. What should they do?
- A. Store the common data encoded as Avro in Google Cloud Storage.
- B. Store the common data in BigQuery as partitioned tables.
- C. Store the common data in BigQuery and expose authorized views.
- D. Store he common data in the HDFS storage for a Google Cloud Dataproc cluster.
Answer: C
NEW QUESTION # 136
You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline. Which component will be used for the data processing operation?
- A. Transform
- B. PCollection
- C. Sink API
- D. Pipeline
Answer: A
Explanation:
In Google Cloud, the Dataflow SDK provides a transform component. It is responsible for the data processing operation. You can use conditional, for loops, and other complex programming structure to create a branching pipeline.
NEW QUESTION # 137
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products of features of the platform. What should you do?
- A. Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs
- B. Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
- C. Export the information to Cloud Stackdriver, and set up an Alerting policy
- D. Run a Virtual Machine in Compute Engine with Airflow, and export the information to Stackdriver
Answer: C
Explanation:
Monitoring does not only provide you with access to Dataflow-related metrics, but also lets you to create alerting policies and dashboards so you can chart time series of metrics and choose to be notified when these metrics reach specified values.
NEW QUESTION # 138
Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?
- A. Dataproc Worker
- B. Dataproc Editor
- C. Dataproc Runner
- D. Dataproc Viewer
Answer: A
Explanation:
Explanation
Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).
Reference: https://cloud.google.com/dataproc/docs/concepts/service-accounts#important_notes
NEW QUESTION # 139
You want to analyze hundreds of thousands of social media posts daily at the lowest cost and with the fewest steps.
You have the following requirements:
* You will batch-load the posts once per day and run them through the Cloud Natural Language API.
* You will extract topics and sentiment from the posts.
* You must store the raw posts for archiving and reprocessing.
* You will create dashboards to be shared with people both inside and outside your organization.
You need to store both the data extracted from the API to perform analysis as well as the raw social media posts for historical archiving. What should you do?
- A. Store the social media posts and the data extracted from the API in BigQuery.
- B. Feed to social media posts into the API directly from the source, and write the extracted data from the API into BigQuery.
- C. Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery.
- D. Store the social media posts and the data extracted from the API in Cloud SQL.
Answer: B
Explanation:
Explanation
NEW QUESTION # 140
Which of these rules apply when you add preemptible workers to a Dataproc cluster (select 2 answers)?
- A. Preemptible workers cannot store data.
- B. Preemptible workers cannot use persistent disk.
- C. If a preemptible worker is reclaimed, then a replacement worker must be added manually.
- D. A Dataproc cluster cannot have only preemptible workers.
Answer: A,D
Explanation:
Explanation
The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:
Processing only-Since preemptibles can be reclaimed at any time, preemptible workers do not store data.
Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
No preemptible-only clusters-To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters.
Persistent disk size-As a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.
Reference: https://cloud.google.com/dataproc/docs/concepts/preemptible-vms
NEW QUESTION # 141
Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?
- A. A sequential numeric ID
- B. A non-sequential numeric ID
- C. A timestamp followed by a stock symbol
- D. A stock symbol followed by a timestamp
Answer: A,C
Explanation:
using a timestamp as the first element of a row key can cause a variety of problems.
In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hotspotting.
Suppose your system assigns a numeric ID to each of your application's users. You might be tempted to use the user's numeric ID as the row key for your table. However, since new users are more likely to be active users, this approach is likely to push most of your traffic to a small number of nodes.
[https://cloud.google.com/bigtable/docs/schema-design]
Reference:
https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotti
NEW QUESTION # 142
You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.
What should you do?
- A. Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
- B. Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
- C. Convert your batch BQ queries into interactive BQ queries.
- D. Create an additional project to overcome the 2K on-demand per-project quota.
Answer: A
Explanation:
Explanation/Reference:
Reference https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery
NEW QUESTION # 143
You are building a model to make clothing recommendations. You know a user's fashion preference is
likely to change over time, so you build a data pipeline to stream new data back to the model as it
becomes available. How should you use this data to train the model?
- A. Train on the new data while using the existing data as your test set.
- B. Continuously retrain the model on a combination of existing data and the new data.
- C. Continuously retrain the model on just the new data.
- D. Train on the existing data while using the new data as your test set.
Answer: A
NEW QUESTION # 144
Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl.
being able to reprocess all failing data).
What should you do?
- A. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
- B. Add a try... catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.
- C. Add a try... catch block to your DoFn that transforms the data, extract erroneous rows from logs.
- D. Add a try... catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.
Answer: D
Explanation:
https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow
NEW QUESTION # 145
Which of these statements about exporting data from BigQuery is false?
- A. The only supported export destination is Google Cloud Storage.
- B. To export more than 1 GB of data, you need to put a wildcard in the destination filename.
- C. Data can only be exported in JSON or Avro format.
- D. The only compression option available is GZIP.
Answer: C
Explanation:
Explanation
Data can be exported in CSV, JSON, or Avro format. If you are exporting nested or repeated data, then CSV format is not supported.
Reference: https://cloud.google.com/bigquery/docs/exporting-data
NEW QUESTION # 146
Your company is running their first dynamic campaign, serving different offers by analyzing real-time data during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable. The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of dat
a. They want to improve this performance while minimizing cost. What should they do?
- A. Redesign the schema to use a single row key to identify values that need to be updated frequently in the cluster.
- B. Redesign the schema to use row keys based on numeric IDs that increase sequentially per user viewing the offers.
- C. The performance issue should be resolved over time as the site of the BigDate cluster is increased.
- D. Redefine the schema by evenly distributing reads and writes across the row space of the table.
Answer: D
NEW QUESTION # 147
Does Dataflow process batch data pipelines or streaming data pipelines?
- A. Only Streaming Data Pipelines
- B. None of the above
- C. Both Batch and Streaming Data Pipelines
- D. Only Batch Data Pipelines
Answer: C
Explanation:
Dataflow is a unified processing model, and can execute both streaming and batch data pipelines
NEW QUESTION # 148
Which of the following statements about Legacy SQL and Standard SQL is not true?
- A. Standard SQL is the preferred query language for BigQuery.
- B. You need to set a query language for each dataset and the default is Standard SQL.
- C. One difference between the two query languages is how you specify fully-qualified table names (i.e.
table names that include their associated project name). - D. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Answer: B
Explanation:
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released. In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
NEW QUESTION # 149
......
Validate your Skills with Updated Professional-Data-Engineer Exam Questions & Answers and Test Engine: https://www.exam4tests.com/Professional-Data-Engineer-valid-braindumps.html
Tested & Approved Professional-Data-Engineer Study Materials Download: https://drive.google.com/open?id=1hSXzrui5SZaytny5wZVqIsFfkC-dINtI