我们正在尝试使用PySpark过滤字段中包含空数组的行。这是DF的架构:
SELECT name from city where LCASE(countrycode) = "jpn"
我们正在尝试两种方法。
首先,定义可以像这样修改行的UDF
root
|-- created_at: timestamp (nullable = true)
|-- screen_name: string (nullable = true)
|-- text: string (nullable = true)
|-- retweet_count: long (nullable = true)
|-- favorite_count: long (nullable = true)
|-- in_reply_to_status_id: long (nullable = true)
|-- in_reply_to_user_id: long (nullable = true)
|-- in_reply_to_screen_name: string (nullable = true)
|-- user_mentions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- id_str: string (nullable = true)
| | |-- indices: array (nullable = true)
| | | |-- element: long (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- screen_name: string (nullable = true)
|-- hashtags: array (nullable = true)
| |-- element: string (containsNull = true)
并使用它来排除empty_array_to_null = udf(lambda arr: None if len(arr) == 0 else arr, ArrayType(StructType()))
中的行。
另一种方法是使用以下UDF:
df.select(empty_array_to_null(df.user_mentions))
并在is_empty = udf(lambda x: len(x) == 0, BooleanType())
两种方法都会抛出错误。 第一种方法产生以下结果:
df.filter(is_empty(df.user_mentions))
第二种方法抛出以下内容:
An error occurred while calling o3061.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1603.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1603.0 (TID 41390, 10.0.0.11): java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 0 fields are required while 5 values are provided.
at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:136)
at org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$fromJava$1.apply(EvaluatePython.scala:122)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
更新:添加了示例数据......
Some of types cannot be determined by the first 100 rows, please try again with sampling
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 57, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 522, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 360, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 347, in _inferSchema
raise ValueError("Some of types cannot be determined by the "
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
...
答案 0 :(得分:5)
其中一种方法是首先获取数组的大小,然后对数组大小为0的行进行过滤。我在此处找到了解决方案How to convert empty arrays to nulls?。
import pyspark.sql.functions as F
df = df.withColumn("size", F.size(F.col(user_mentions)))
df_filtered = df.filter(F.col("size") >= 1)