Question

我试图遍历Spark数据框中的“关键字”列，以查找特定值。它返回一个空的数据框。所以我的猜测是数据有问题。我的列中的原始数据如下所示：

[\mashed potatoes\"
[\american\"
[\runway\"

我尝试将“ \”元素替换为“”，并将字符串转换为数组。请参见下面的代码。

# split a string and convert a string to an array
test_df = spark_df.withColumn(
  "Keywords", 
   split(col("Keywords"), "\[\]")
)

我也尝试了以下方法：

# replace the elements
replace = udf(lambda x: x.replace(u'[\]',''))
cleaned_df = spark_df.withColumn('Keywords', replace('Keywords'))

# Converting a string into an array
new_df = cleaned_df.withColumn("Keywords", split(col("Keywords"), ",").cast("array<long>"))
new_df.printSchema()

 |-- Keywords: array (nullable = true)
 |    |-- element: long (containsNull = true)

这两种方法都不会给我错误，但是当我实际遍历数据框时，它会吐出一个空的数据框

# Condition to identify if we have at least, 1 valid word
valid_words = {"attraction"}

beauty_df = test_df.filter(udf(lambda kwords: len(valid_words & set(kwords))>0, 
                                  BooleanType())(test_df.Keywords))
beauty_df.show()

替换火花列中的特定元素

0 个答案: