Associate-Developer-Apache-Spark PDF Exam Material 2023 Realistic Associate-Developer-Apache-Spark Dumps Questions [Q103-Q125]

Share

Associate-Developer-Apache-Spark PDF Exam Material 2023 Realistic Associate-Developer-Apache-Spark Dumps Questions

Updated Databricks Associate-Developer-Apache-Spark Dumps – PDF & Online Engine

NEW QUESTION # 103
Which of the following describes properties of a shuffle?

  • A. Operations involving shuffles are never evaluated lazily.
  • B. Shuffles involve only single partitions.
  • C. A shuffle is one of many actions in Spark.
  • D. Shuffles belong to a class known as "full transformations".
  • E. In a shuffle, Spark writes data to disk.

Answer: E

Explanation:
Explanation
In a shuffle, Spark writes data to disk.
Correct! Spark's architecture dictates that intermediate results during a shuffle are written to disk.
A shuffle is one of many actions in Spark.
Incorrect. A shuffle is a transformation, but not an action.
Shuffles involve only single partitions.
No, shuffles involve multiple partitions. During a shuffle, Spark generates output partitions from multiple input partitions.
Operations involving shuffles are never evaluated lazily.
Wrong. A shuffle is a costly operation and Spark will evaluate it as lazily as other transformations. This is, until a subsequent action triggers its evaluation.
Shuffles belong to a class known as "full transformations".
Not quite. Shuffles belong to a class known as "wide transformations". "Full transformation" is not a relevant term in Spark.
More info: Spark - The Definitive Guide, Chapter 2 and Spark: disk I/O on stage boundaries explanation - Stack Overflow


NEW QUESTION # 104
Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

  • A. itemsDf.write.option('destination', 'memory').save()
  • B. itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
  • C. itemsDf.store()
  • D. itemsDf.persist(StorageLevel.MEMORY_ONLY)
  • E. itemsDf.cache()

Answer: E

Explanation:
Explanation
The key to solving this question is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.
If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.
Static notebook | Dynamic notebook: See test 2


NEW QUESTION # 105
Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type double?

  • A. spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})
  • B. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]})
  • C. CharType()), T.StructField("season", T.DoubleType())]))
  • D. 1. from pyspark.sql import types as T
    2. spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season",
  • E. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])
  • F. spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"])

Answer: F

Explanation:
Explanation
spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session's createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple.
The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column
"season" contain only strings, Spark will cast the column appropriately as string.
Find out more about SparkSession.createDataFrame() via the link below.
spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method.
from pyspark.sql import types as T
spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season",
T.CharType()), T.StructField("season", T.DoubleType())]))
No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark.
spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python's pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame.
Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below.
spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method.
More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.1 documentation and Data Types - Spark 3.1.2 Documentation Static notebook | Dynamic notebook: See test 1


NEW QUESTION # 106
Which of the following are valid execution modes?

  • A. Client, Cluster, Local
  • B. Standalone, Client, Cluster
  • C. Kubernetes, Local, Client
  • D. Server, Standalone, Client
  • E. Cluster, Server, Local

Answer: A

Explanation:
Explanation
This is a tricky question to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably.
There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is located with respect to each other.
In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM (Java Virtual Machine) in a single computer which then also includes the driver.
Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN, Apache Mesos and Kubernetes.
Client, Cluster, Local
Correct, all of these are the valid execution modes in Spark.
Standalone, Client, Cluster
No, standalone is not a valid execution mode. It is a valid deployment mode, though.
Kubernetes, Local, Client
No, Kubernetes is a deployment mode, but not an execution mode.
Cluster, Server, Local
No, Server is not an execution mode.
Server, Standalone, Client
No, standalone and server are not execution modes.
More info: Apache Spark Internals - Learning Journal


NEW QUESTION # 107
Which of the following code blocks shows the structure of a DataFrame in a tree-like way, containing both column names and types?

  • A. itemsDf.print.schema()
  • B. 1.print(itemsDf.columns)
    2.print(itemsDf.types)
  • C. itemsDf.rdd.printSchema()
  • D. itemsDf.printSchema()
  • E. spark.schema(itemsDf)

Answer: D

Explanation:
Explanation
itemsDf.printSchema()
Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types:
root
|-- itemId: integer (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- supplier: string (nullable = true)
itemsDf.rdd.printSchema()
No, the DataFrame's underlying RDD does not have a printSchema() method.
spark.schema(itemsDf)
Incorrect, there is no spark.schema command.
print(itemsDf.columns)
print(itemsDf.dtypes)
Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way.
itemsDf.print.schema()
No, DataFrame does not have a print method.
Static notebook | Dynamic notebook: See test 3


NEW QUESTION # 108
Which of the following code blocks silently writes DataFrame itemsDf in avro format to location fileLocation if a file does not yet exist at that location?

  • A. itemsDf.write.format("avro").mode("ignore").save(fileLocation)
  • B. itemsDf.write.avro(fileLocation)
  • C. itemsDf.save.format("avro").mode("ignore").write(fileLocation)
  • D. spark.DataFrameWriter(itemsDf).format("avro").write(fileLocation)
  • E. itemsDf.write.format("avro").mode("errorifexists").save(fileLocation)

Answer: B

Explanation:
Explanation
The trick in this question is knowing the "modes" of the DataFrameWriter. Mode ignore will ignore if a file already exists and not replace that file, but also not throw an error. Mode errorifexists will throw an error, and is the default mode of the DataFrameWriter. The question NO:
explicitly calls for the DataFrame to be "silently" written if it does not exist, so you need to specify mode("ignore") here to avoid having Spark communicate any error to you if the file already exists.
The `overwrite' mode would not be right here, since, although it would be silent, it would overwrite the already-existing file. This is not what the question asks for.
It is worth noting that the option starting with spark.DataFrameWriter(itemsDf) cannot work, since spark references the SparkSession object, but that object does not provide the DataFrameWriter.
As you can see in the documentation (below), DataFrameWriter is part of PySpark's SQL API, but not of its SparkSession API.
More info:
DataFrameWriter: pyspark.sql.DataFrameWriter.save - PySpark 3.1.1 documentation SparkSession API: Spark SQL - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1


NEW QUESTION # 109
Which of the following code blocks returns a new DataFrame with only columns predError and values of every second row of DataFrame transactionsDf?
Entire DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+

  • A. transactionsDf.filter("transactionId" % 2 == 0).select("predError", "value")
  • B. transactionsDf.select(col("transactionId").isin([3,4,6]), "predError", "value")
  • C. 1.transactionsDf.createOrReplaceTempView("transactionsDf")
    2.spark.sql("FROM transactionsDf SELECT predError, value WHERE transactionId % 2 = 2")
  • D. transactionsDf.filter(col("transactionId") % 2 == 0).select("predError", "value") (Correct)
  • E. transactionsDf.filter(col("transactionId").isin([3,4,6])).select([predError, value])
  • F. transactionsDf.filter(col(transactionId).isin([3,4,6]))

Answer: D

Explanation:
Explanation
Output of correct code block:
+---------+-----+
|predError|value|
+---------+-----+
| 6| 7|
| null| null|
| 3| 2|
+---------+-----+
This is not an easy question to solve. You need to know that % stands for the module operator in Python. % 2 will return true for every second row. The statement using spark.sql gets it almost right (the modulo operator exists in SQL as well), but % 2 = 2 will never yield true, since modulo 2 is either 0 or 1.
Other answers are wrong since they are missing quotes around the column names and/or use filter or select incorrectly.
If you have any doubts about SparkSQL and answer options 3 and 4 in this question, check out the notebook I created as a response to a related student question.
Static notebook | Dynamic notebook: See test 1


NEW QUESTION # 110
The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header and casting the columns in the most appropriate type. Find the error.
First 3 rows of transactions.csv:
1.transactionId;storeId;productId;name
2.1;23;12;green grass
3.2;35;31;yellow sun
4.3;23;12;green grass
Code block:
transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)

  • A. Spark is unable to understand the file type.
  • B. The code block is unable to capture all columns.
  • C. The DataFrameReader is not accessed correctly.
  • D. The resulting DataFrame will not have the appropriate schema.
  • E. The transaction is evaluated lazily, so no file will be read.

Answer: D

Explanation:
Explanation
Correct code block:
transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True, inferSchema=True) By default, Spark does not infer the schema of the CSV (since this usually takes some time). So, you need to add the inferSchema=True option to the code block.
More info: pyspark.sql.DataFrameReader.csv - PySpark 3.1.2 documentation


NEW QUESTION # 111
Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column productId from DataFrame transactionsDf?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+

  • A. transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))
  • B. transactionsDf.max('value').min('value')
  • C. transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))
  • D. transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))
  • E. transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Answer: D

Explanation:
Explanation
transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest')) Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups.
transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")}) Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong.
If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value.
transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest')) Incorrect. While this is valid Spark syntax, it does not achieve what the question asks for. The question specifically asks for values to be aggregated per value in column productId - this column is not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group.
transactionsDf.max('value').min('value')
Wrong. There is no DataFrame.max() method in Spark, so this command will fail.
transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest")) No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand which columns you want to aggregate.
More info: pyspark.sql.DataFrame.agg - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3


NEW QUESTION # 112
The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))

  • A. 1. length
    2. regexp_replace
    3. lower
    4. col("itemName")
    5. alias
  • B. 1. size
    2. regexp_replace
    3. lower
    4. "itemName"
    5. alias
  • C. 1. length
    2. regexp_extract
    3. upper
    4. col("itemName")
    5. as
  • D. 1. size
    2. regexp_extract
    3. lower
    4. col("itemName")
    5. alias
  • E. 1. lower
    2. regexp_replace
    3. length
    4. "itemName"
    5. alias

Answer: A

Explanation:
Explanation
Correct code block:
itemsDf.select(length(regexp_replace(lower(col("itemName")), "a|e|i|o|u|\s", "")).alias("consonant_ct")) Returned DataFrame:
+------------+
|consonant_ct|
+------------+
| 19|
| 16|
| 10|
+------------+
This question tries to make you think about the string functions Spark provides and in which order they should be applied. Arguably the most difficult part, the regular expression "a|e|i|o|u|
\s", is not a numbered blank. However, if you are not familiar with the string functions, it may be a good idea to review those before the exam.
The size operator and the length operator can easily be confused. size works on arrays, while length works on strings. Luckily, this is something you can read up about in the documentation.
The code block works by first converting all uppercase letters in column itemName into lowercase (the lower() part). Then, it replaces all vowels by "nothing" - an empty character "" (the regexp_replace() part). Now, only lowercase characters without spaces are included in the DataFrame. Then, per row, the length operator counts these remaining characters. Note that column itemName in itemsDf does not include any numbers or other characters, so we do not need to make any provisions for these. Finally, by using the alias() operator, we rename the resulting column to consonant_ct.
More info:
- lower: pyspark.sql.functions.lower - PySpark 3.1.2 documentation
- regexp_replace: pyspark.sql.functions.regexp_replace - PySpark 3.1.2 documentation
- length: pyspark.sql.functions.length - PySpark 3.1.2 documentation
- alias: pyspark.sql.Column.alias - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2


NEW QUESTION # 113
The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.
Code block:

  • A. Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.
  • B. transactionsDf.format("parquet").option("mode", "append").save(path)
  • C. The mode option should be omitted so that the command uses the default mode.
  • D. The code block is missing a reference to the DataFrameWriter.
  • E. The code block is missing a bucketBy command that takes care of partitions.
  • F. save() is evaluated lazily and needs to be followed by an action.

Answer: D

Explanation:
Explanation
Correct code block:
transactionsDf.write.format("parquet").option("mode", "append").save(path)


NEW QUESTION # 114
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted?
Excerpt of DataFrame transactionsDf:

  • A. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy"))
  • B. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate"))
  • C. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted")
  • D. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy"))
  • E. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy"))

Answer: E

Explanation:
Explanation
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy")) Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column transactionDate into strings, following the format requested in the question.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy")) No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year.
transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column.
transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted") Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped data - but this is irrelevant for this question, since we do not deal with grouped data here.
transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this:
2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not what is asked for in the question.
More info: pyspark.sql.functions.from_unixtime - PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1


NEW QUESTION # 115
Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

  • A. transactionsDf.drop(["predError", "value"])
  • B. transactionsDf.drop(predError, value)
  • C. transactionsDf.drop(col("predError"), col("value"))
  • D. transactionsDf.drop("predError & value")
  • E. transactionsDf.drop("predError", "value")

Answer: E

Explanation:
Explanation
More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2


NEW QUESTION # 116
Which of the following code blocks performs a join in which the small DataFrame transactionsDf is sent to all executors where it is joined with DataFrame itemsDf on columns storeId and itemId, respectively?

  • A. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "right_outer")
  • B. itemsDf.merge(transactionsDf, "itemsDf.itemId == transactionsDf.storeId", "broadcast")
  • C. itemsDf.join(transactionsDf, broadcast(itemsDf.itemId == transactionsDf.storeId))
  • D. itemsDf.join(broadcast(transactionsDf), itemsDf.itemId == transactionsDf.storeId)
  • E. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.storeId, "broadcast")

Answer: D

Explanation:
Explanation
The issue with all answers that have "broadcast" as very last argument is that "broadcast" is not a valid join type. While the entry with "right_outer" is a valid statement, it is not a broadcast join. The item where broadcast() is wrapped around the equality condition is not valid code in Spark. broadcast() needs to be wrapped around the name of the small DataFrame that should be broadcast.
More info: Learning Spark, 2nd Edition, Chapter 7
Static notebook | Dynamic notebook: See test 1
tion and explanation?


NEW QUESTION # 117
Which of the following DataFrame operators is never classified as a wide transformation?

  • A. DataFrame.select()
  • B. DataFrame.aggregate()
  • C. DataFrame.sort()
  • D. DataFrame.join()
  • E. DataFrame.repartition()

Answer: A

Explanation:
Explanation
As a general rule: After having gone through the practice tests you probably have a good feeling for what classifies as a wide and what classifies as a narrow transformation. If you are unsure, feel free to play around in Spark and display the explanation of the Spark execution plan via DataFrame.[operation, for example sort()].explain(). If repartitioning is involved, it would count as a wide transformation.
DataFrame.select()
Correct! A wide transformation includes a shuffle, meaning that an input partition maps to one or more output partitions. This is expensive and causes traffic across the cluster. With the select() operation however, you pass commands to Spark that tell Spark to perform an operation on a specific slice of any partition. For this, Spark does not need to exchange data across partitions, each partition can be worked on independently. Thus, you do not cause a wide transformation.
DataFrame.repartition()
Incorrect. When you repartition a DataFrame, you redefine partition boundaries. Data will flow across your cluster and end up in different partitions after the repartitioning is completed. This is known as a shuffle and, in turn, is classified as a wide transformation.
DataFrame.aggregate()
No. When you aggregate, you may compare and summarize data across partitions. In the process, data are exchanged across the cluster, and newly formed output partitions depend on one or more input partitions. This is a typical characteristic of a shuffle, meaning that the aggregate operation may classify as a wide transformation.
DataFrame.join()
Wrong. Joining multiple DataFrames usually means that large amounts of data are exchanged across the cluster, as new partitions are formed. This is a shuffle and therefore DataFrame.join() counts as a wide transformation.
DataFrame.sort()
False. When sorting, Spark needs to compare many rows across all partitions to each other. This is an expensive operation, since data is exchanged across the cluster and new partitions are formed as data is reordered. This process classifies as a shuffle and, as a result, DataFrame.sort() counts as wide transformation.
More info: Understanding Apache Spark Shuffle | Philipp Brunenberg


NEW QUESTION # 118
In which order should the code blocks shown below be run in order to read a JSON file from location jsonPath into a DataFrame and return only the rows that do not have value 3 in column productId?
1. importedDf.createOrReplaceTempView("importedDf")
2. spark.sql("SELECT * FROM importedDf WHERE productId != 3")
3. spark.sql("FILTER * FROM importedDf WHERE productId != 3")
4. importedDf = spark.read.option("format", "json").path(jsonPath)
5. importedDf = spark.read.json(jsonPath)

  • A. 5, 1, 3
  • B. 5, 1, 2
  • C. 5, 2
  • D. 4, 1, 3
  • E. 4, 1, 2

Answer: B

Explanation:
Explanation
Correct code block:
importedDf = spark.read.json(jsonPath)
importedDf.createOrReplaceTempView("importedDf")
spark.sql("SELECT * FROM importedDf WHERE productId != 3")
Option 5 is the only correct way listed of reading in a JSON in PySpark. The option("format", "json") is not the correct way to tell Spark's DataFrameReader that you want to read a JSON file. You would do this through format("json") instead. Also, you can communicate the specific path of the JSON file to the DataFramReader using the load() method, not the path() method.
In order to use a SQL command through the SparkSession spark, you first need to create a temporary view through DataFrame.createOrReplaceTempView().
The SQL statement should start with the SELECT operator. The FILTER operator SQL provides is not the correct one to use here.
Static notebook | Dynamic notebook: See test 2


NEW QUESTION # 119
Which of the following is not a feature of Adaptive Query Execution?

  • A. Replace a sort merge join with a broadcast join, where appropriate.
  • B. Split skewed partitions into smaller partitions to avoid differences in partition processing time.
  • C. Collect runtime statistics during query execution.
  • D. Reroute a query in case of an executor failure.
  • E. Coalesce partitions to accelerate data processing.

Answer: D

Explanation:
Explanation
Reroute a query in case of an executor failure.
Correct. Although this feature exists in Spark, it is not a feature of Adaptive Query Execution. The cluster manager keeps track of executors and will work together with the driver to launch an executor and assign the workload of the failed executor to it (see also link below).
Replace a sort merge join with a broadcast join, where appropriate.
No, this is a feature of Adaptive Query Execution.
Coalesce partitions to accelerate data processing.
Wrong, Adaptive Query Execution does this.
Collect runtime statistics during query execution.
Incorrect, Adaptive Query Execution (AQE) collects these statistics to adjust query plans. This feedback loop is an essential part of accelerating queries via AQE.
Split skewed partitions into smaller partitions to avoid differences in partition processing time.
No, this is indeed a feature of Adaptive Query Execution. Find more information in the Databricks blog post linked below.
More info: Learning Spark, 2nd Edition, Chapter 12, On which way does RDD of spark finish fault-tolerance?
- Stack Overflow, How to Speed up SQL Queries with Adaptive Query Execution


NEW QUESTION # 120
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?

  • A. transactionsDf.select("storeId").distinct()
    (Correct)
  • B. transactionsDf["storeId"].distinct()
  • C. transactionsDf.select(col("storeId").distinct())
  • D. transactionsDf.distinct("storeId")
  • E. transactionsDf.filter("storeId").distinct()

Answer: A

Explanation:
Explanation
distinct() is a method of a DataFrame. Knowing this, or recognizing this from the documentation, is the key to solving this question.
More info: pyspark.sql.DataFrame.distinct - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2


NEW QUESTION # 121
The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.
A sample of DataFrame itemsDf is below.

Code block:
itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

  • A. Since itemId is the index, it does not need to be an argument to the select() method.
  • B. explode() is not a method of DataFrame. explode() should be used inside the select() method instead.
  • C. The explode() method expects a Column object rather than a string.
  • D. The split() method should be used inside the select() method instead of the explode() method.
  • E. The alias() method needs to be called after the select() method.

Answer: B

Explanation:
The correct code block looks like this:

Then, the first couple of rows of itemAttributesDf look like this:

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.
This is correct.
The split() method should be used inside the select() method instead of the explode() method.
No, the split() method is used to split strings into parts. However, column attributs is an array of strings. In this case, the explode() method is appropriate.
Since itemId is the index, it does not need to be an argument to the select() method.
No, itemId still needs to be selected, whether it is used as an index or not.
The explode() method expects a Column object rather than a string.
No, a string works just fine here. This being said, there are some valid alternatives to passing in a string:

The alias() method needs to be called after the select() method.
No.
More info: pyspark.sql.functions.explode - PySpark 3.1.1 documentation (https://bit.ly/2QUZI1J) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/22.html ,
https://bit.ly/sparkpracticeexams_import_instructions)


NEW QUESTION # 122
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?

  • A. itemsDf.sample(fraction=0.1, seed=87238)
  • B. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
  • C. itemsDf.sample(fraction=1000, seed=98263)
  • D. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
  • E. itemsDf.sample(fraction=0.1)

Answer: A

Explanation:
Explanation
itemsDf.sample(fraction=0.1, seed=87238)
Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning duplicates, you should leave the withReplacement parameter at False, which is the default. Since the question specifies that the same rows should be returned even if the code block is run twice, you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.
itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.
Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999 balls.
itemsDf.sample(fraction=1000, seed=98263)
Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since
1,000/10,000 = 0.1.
itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should depend on.
itemsDf.sample(fraction=0.1)
Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to specify a seed.
More info:
- pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.sampleBy - PySpark 3.1.2 documentation
- Types of Samplings in PySpark 3. The explanations of the sampling... | by Pinar Ersoy | Towards Data Science


NEW QUESTION # 123
Which of the following code blocks produces the following output, given DataFrame transactionsDf?
Output:
1.root
2. |-- transactionId: integer (nullable = true)
3. |-- predError: integer (nullable = true)
4. |-- value: integer (nullable = true)
5. |-- storeId: integer (nullable = true)
6. |-- productId: integer (nullable = true)
7. |-- f: integer (nullable = true)
DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+

  • A. print(transactionsDf.schema)
  • B. transactionsDf.schema.print()
  • C. transactionsDf.rdd.formatSchema()
  • D. transactionsDf.rdd.printSchema()
  • E. transactionsDf.printSchema()

Answer: E

Explanation:
Explanation
The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD documentation linked below). The output of print(transactionsDf.schema) is this:
StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerType,true),StructField (value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,IntegerType,true),StructFiel It includes the same information as the nicely formatted original output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method.
More info:
- pyspark.RDD: pyspark.RDD - PySpark 3.1.2 documentation
- DataFrame.printSchema(): pyspark.sql.DataFrame.printSchema - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2


NEW QUESTION # 124
The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
transactionsDf.__1__(__2__).__3__

  • A. 1. limit
    2. "storeId"
    3. printSchema()
  • B. 1. select
    2. "storeId"
    3. print_schema()
  • C. 1. select
    2. storeId
    3. dtypes
  • D. 1. select
    2. "storeId"
    3. printSchema()
  • E. 1. limit
    2. 1
    3. columns

Answer: E

Explanation:
Explanation
Correct code block:
transactionsDf.select("storeId").printSchema()
The difficulty of this question is that it is hard to solve with the stepwise first-to-last-gap approach that has worked well for similar questions, since the answer options are so different from one another. Instead, you might want to eliminate answers by looking for patterns of frequently wrong answers.
A first pattern that you may recognize by now is that column names are not expressed in quotes. For this reason, the answer that includes storeId should be eliminated.
By now, you may have understood that the DataFrame.limit() is useful for returning a specified amount of rows. It has nothing to do with specific columns. For this reason, the answer that resolves to limit("storeId") can be eliminated.
Given that we are interested in information about the data type, you should question whether the answer that resolves to limit(1).columns provides you with this information. While DataFrame.columns is a valid call, it will only report back column names, but not column types. So, you can eliminate this option.
The two remaining options either use the printSchema() or print_schema() command. You may remember that DataFrame.printSchema() is the only valid command of the two. The select("storeId") part just returns the storeId column of transactionsDf - this works here, since we are only interested in that column's type anyways.
More info: pyspark.sql.DataFrame.printSchema - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3


NEW QUESTION # 125
......

Databricks Associate-Developer-Apache-Spark Dumps PDF Are going to be The Best Score: https://www.exam4tests.com/Associate-Developer-Apache-Spark-valid-braindumps.html

Associate-Developer-Apache-Spark.pdf - Questions Answers PDF Sample Questions Reliable: https://drive.google.com/open?id=1X0v6I0-F2caKAxSDekmkpRrI6_9cYrEh