Latest Professional-Data-Engineer Exam Dumps Google Exam from Training Expert ITExamSimulator [Q132-Q153]

Share

Latest Professional-Data-Engineer Exam Dumps Google Exam from Training Expert ITExamSimulator

Pass Google Google Certified Professional Data Engineer Exam PDF Dumps | Recently Updated 253 Questions


Training Courses Recommended for the Exam Preparation

Training courses are meant to help candidates to learn about the Google exam syllabus and prepare well. It has hands-on labs and expert support that will allow you to get in-depth knowledge of each domain covered in the test. So, these are some of the best training courses offered by Google for the Professional Data Engineer certification exam.

 

NEW QUESTION 132
You are designing a cloud-native historical data processing system to meet the following conditions:
* The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
* A streaming data pipeline stores new data daily.
* Peformance is not a factor in the solution.
* The solution design should maximize availability.
How should you design data storage for this solution?

  • A. Store the data in BigQuery. Access the data using the BigQuery Connector or Cloud Dataproc and Compute Engine.
  • B. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
  • C. Store the data in a regional Cloud Storage bucket. Aceess the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
  • D. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.

Answer: C

 

NEW QUESTION 133
You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?

  • A. Enable query caching so you can cache data from previous months
  • B. Create separate views to cover each month, and query from these views
  • C. Convert the sharded tables into a single partitioned table
  • D. Convert all daily log tables into date-partitioned tables

Answer: D

 

NEW QUESTION 134
Your weather app queries a database every 15 minutes to get the current temperature. The frontend is powered by Google App Engine and server millions of users. How should you design the frontend to respond to a database failure?

  • A. Issue a command to restart the database servers.
  • B. Retry the query every second until it comes back online to minimize staleness of data.
  • C. Reduce the query frequency to once every hour until the database comes back online.
  • D. Retry the query with exponential backoff, up to a cap of 15 minutes.

Answer: D

 

NEW QUESTION 135
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt.
You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?

  • A. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
  • B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
  • C. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
  • D. Add a SideInput that returns a Boolean if the element is corrupt.

Answer: B

 

NEW QUESTION 136
You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose?

  • A. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches.
  • B. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function.
  • C. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.
  • D. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications.

Answer: D

 

NEW QUESTION 137
Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?

  • A. Get the identity and access management IIAM) policy of each table
  • B. Use Google Stackdriver Audit Logs to review data access.
  • C. Use Stackdriver Monitoring to see the usage of BigQuery query slots.
  • D. Use the Google Cloud Billing API to see what account the warehouse is being billed to.

Answer: B

 

NEW QUESTION 138
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high- value problems instead of problems with our data pipelines.
You need to compose visualization for operations teams with the following requirements:
* Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)
* The report must not be more than 3 hours delayed from live data.
* The actionable report should only show suboptimal links.
* Most suboptimal links should be sorted to the top.
* Suboptimal links can be grouped and filtered by regional geography.
* User response time to load the report must be <5 seconds.
You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

  • A. Look through the current data and compose a series of charts and tables, one for each possible combination of criteria.
  • B. Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.
  • C. Export the data to a spreadsheet, compose a series of charts and tables, one for each possible combination of criteria, and spread them across multiple tabs.
  • D. Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

Answer: B

 

NEW QUESTION 139
Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error.
SELECT person FROM `project1.example.table1` WHERE city = "London"
How would you correct the error?

  • A. Change "person" to "city.person".
  • B. Add ", UNNEST(person)" before the WHERE clause.
  • C. Change "person" to "person.city".
  • D. Add ", UNNEST(city)" before the WHERE clause.

Answer: B

Explanation:
To access the person.city column, you need to "UNNEST(person)" and JOIN it to table1 using a comma.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#nested_repeated_results

 

NEW QUESTION 140
You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure.
What should you do?

  • A. Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.
  • B. Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
  • C. Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.
  • D. Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.

Answer: A

 

NEW QUESTION 141
You are deploying a new storage system for your mobile application, which is a media streaming service.
You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity 'Movie'the property 'actors'and the property 'tags' have multiple values but the property 'date released' does not. A typical query would ask for all movies with actor=<actorname>ordered by date_releasedor all movies with tag=Comedyordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

  • A. Manually configure the index in your index config as follows:
  • B. Manually configure the index in your index config as follows:
  • C. Set the following in your entity options: exclude_from_indexes = 'date_published'
  • D. Set the following in your entity options: exclude_from_indexes = 'actors, tags'

Answer: B

 

NEW QUESTION 142
Which of these are examples of a value in a sparse vector? (Select 2 answers.)

  • A. [0, 1]
  • B. [1, 0, 0, 0, 0, 0, 0]
  • C. [0, 0, 0, 1, 0, 0, 1]
  • D. [0, 5, 0, 0, 0, 0]

Answer: A,B

Explanation:
Explanation
Categorical features in linear models are typically translated into a sparse vector in which each possible value has a corresponding index or id. For example, if there are only three possible eye colors you can represent
'eye_color' as a length 3 vector: 'brown' would become [1, 0, 0], 'blue' would become [0, 1, 0] and 'green' would become [0, 0, 1]. These vectors are called "sparse" because they may be very long, with many zeros, when the set of possible values is very large (such as all English words).
[0, 0, 0, 1, 0, 0, 1] is not a sparse vector because it has two 1s in it. A sparse vector contains only a single 1.
[0, 5, 0, 0, 0, 0] is not a sparse vector because it has a 5 in it. Sparse vectors only contain 0s and 1s.
Reference: https://www.tensorflow.org/tutorials/linear#feature_columns_and_transformations

 

NEW QUESTION 143
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?

  • A. Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
  • B. Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
  • C. Create a Google Cloud Dataflow job to process the data.
  • D. Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
  • E. Create a Hadoop cluster on Google Compute Engine that uses persistent disks.

Answer: B

 

NEW QUESTION 144
You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages. What is the most likely cause of these duplicate messages?

  • A. Your custom endpoint has an out-of-date SSL certificate.
  • B. The message body for the sensor event is too large.
  • C. The Cloud Pub/Sub topic has too many messages published to it.
  • D. Your custom endpoint is not acknowledging messages within the acknowledgement deadline.

Answer: D

Explanation:
Until or unless the message is not acknowledged within defined ack window period for every message, we will get duplicate (number of retries to send message can be defined).
https://cloud.google.com/pubsub/docs/troubleshooting#dupes

 

NEW QUESTION 145
Which of these is not a supported method of putting data into a partitioned table?

  • A. Use ORDER BY to put a table's rows into chronological order and then change the table's type to
    "Partitioned".
  • B. If you have existing data in a separate file for each day, then create a partitioned table and upload each file into the appropriate partition.
  • C. Run a query to get the records for a specific day from an existing table and for the destination table, specify a partitioned table ending with the day in the format "$YYYYMMDD".
  • D. Create a partitioned table and stream new records to it every day.

Answer: A

Explanation:
You cannot change an existing table into a partitioned table. You must create a partitioned table from scratch. Then you can either stream data into it every day and the data will automatically be put in the right partition, or you can load data into a specific partition by using "$YYYYMMDD" at the end of the table name.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables

 

NEW QUESTION 146
Google Cloud Bigtable indexes a single value in each row. This value is called the _______.

  • A. master key
  • B. unique key
  • C. primary key
  • D. row key

Answer: D

Explanation:
Explanation
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, allowing you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
Reference: https://cloud.google.com/bigtable/docs/overview

 

NEW QUESTION 147
Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11] SELECT age FROM bigquery-public-data.noaa_gsod.gsod WHERE age != 99 AND_TABLE_SUFFIX = `1929' ORDER BY age DESC Which table name will make the SQL statement work correctly?

  • A. `bigquery-public-data.noaa_gsod.gsod`
  • B. `bigquery-public-data.noaa_gsod.gsod'*
  • C. `bigquery-public-data.noaa_gsod.gsod*`
  • D. bigquery-public-data.noaa_gsod.gsod*

Answer: D

 

NEW QUESTION 148
You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc?
(Choose two.)

  • A. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.
  • B. Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.
  • C. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
  • D. Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables.
    Replicate external Hive tables to the native ones.
  • E. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.

Answer: C,E

Explanation:
HDFS lies on datanode, data on masternode needs to be copied on datanode.

 

NEW QUESTION 149
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow.
Numerous data logs are being are being generated during this step, and the team wants to analyze them.
Due to the dynamic nature of the campaign, the data is growing exponentially every hour. The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named("ReadLogData")
.from("clouddataflow-readonly:samples.log_data")
You want to improve the performance of this data read. What should you do?

  • A. Use .fromQuery operation to read specific fields from the table.
  • B. Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.
  • C. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
  • D. Specify the TableReference object in the code.

Answer: B

 

NEW QUESTION 150
Which of the following IAM roles does your Compute Engine account require to be able to run pipeline jobs?

  • A. dataflow.worker
  • B. dataflow.viewer
  • C. dataflow.compute
  • D. dataflow.developer

Answer: A

Explanation:
The dataflow.worker role provides the permissions necessary for a Compute Engine service account to execute work units for a Dataflow pipeline Reference: https://cloud.google.com/dataflow/access-control

 

NEW QUESTION 151
Which of the following statements about the Wide & Deep Learning model are true? (Select
2 answers.)

  • A. A good use for the wide and deep model is a small-scale linear regression problem.
  • B. The wide model is used for memorization, while the deep model is used for generalization.
  • C. The wide model is used for generalization, while the deep model is used for memorization.
  • D. A good use for the wide and deep model is a recommender system.

Answer: B,D

Explanation:
Can we teach computers to learn like humans do, by combining the power of memorization and generalization? It's not an easy question to answer, but by jointly training a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. At Google, we call it Wide & Deep Learning. It's useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.
Reference: https://research.googleblog.com/2016/06/wide-deep-learning-better-together- with.html

 

NEW QUESTION 152
When running a pipeline that has a BigQuery source, on your local machine, you continue to get permission denied errors. What could be the reason for that?

  • A. Your gcloud does not have access to the BigQuery resources
  • B. You are missing gcloud on your machine
  • C. Pipelines cannot be run locally
  • D. BigQuery cannot be accessed from local machines

Answer: A

Explanation:
When reading from a Dataflow source or writing to a Dataflow sink using DirectPipelineRunner, the Cloud Platform account that you configured with the gcloud executable will need access to the corresponding source/sink Reference: https://cloud.google.com/dataflow/java- sdk/JavaDoc/com/google/cloud/dataflow/sdk/runners/DirectPipelineRunner

 

NEW QUESTION 153
......

Updated Test Engine to Practice Professional-Data-Engineer Dumps & Practice Exam: https://testking.itexamsimulator.com/Professional-Data-Engineer-brain-dumps.html