我试图从我的单词数组中仅删除数字单词,但是我创建的函数无法正常工作。当我尝试从数据框中查看信息时,出现以下错误消息
首先,我转换了我的字符串和单词标记
from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\\W"
)
data = regexTokenizer.transform(data)
我创建了仅删除数字的功能
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType())
通话功能
data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)
错误:
org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段5.0中的任务0失败4次,最近一次失败:阶段5.0中的任务0.3丢失(TID 14,10.139.64.4,执行者0):组织.apache.spark.api.python.PythonException:追溯(最近一次调用为最后一次):
示例数据框
+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja", "40567","detalhe","couro"] |
| 22033|["multipane","6768686868686867868888","220v","branco"] |
+-----------+-----------------------------------------------------------+
预期结果
+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja","detalhe","couro"] |
| 22033|["multipane","220v","branco"] |
+-----------+-----------------------------------------------------------+
答案 0 :(得分:0)
作为帮助@pault的解决方案是这个。
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType()
通话功能
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType
filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))
data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))
答案 1 :(得分:0)
如果出于性能原因要避免udf(),并且如果“描述”列中不会出现逗号,则下面的scala解决方案将起作用。 pyspark中的df.withColumn()应该类似。
注意:我还添加了第三条记录,以显示当数字出现在数组的开头/结尾时该解决方案有效。试试吧。
scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]
scala> df.show(false)
+----------+-------------------------------------------------+
|categoryid|description |
+----------+-------------------------------------------------+
|33004 |[short, sarja, 40567, detalhe, couro] |
|22033 |[multipane, 6768686868686867868888, 220v, branco]|
|33033 |[0123, x220, 220v, 889] |
+----------+-------------------------------------------------+
scala> df.withColumn("newc",split(regexp_replace(regexp_replace(regexp_replace(concat_ws(",",'description),"""\b\d+\b""",""),"""^,|,$""",""),",,",","),",")).show(false)
+----------+-------------------------------------------------+------------------------------+
|categoryid|description |newc |
+----------+-------------------------------------------------+------------------------------+
|33004 |[short, sarja, 40567, detalhe, couro] |[short, sarja, detalhe, couro]|
|22033 |[multipane, 6768686868686867868888, 220v, branco]|[multipane, 220v, branco] |
|33033 |[0123, x220, 220v, 889] |[x220, 220v] |
+----------+-------------------------------------------------+------------------------------+
scala>
火花2.4答案
从2.4版开始使用spark-sql,您可以使用filter()高阶函数并获取结果
scala> val df = Seq((33004,Array("short","sarja", "40567","detalhe","couro")), (22033,Array("multipane","6768686868686867868888","220v","branco")), (33033,Array("0123","x220","220v","889"))).toDF("categoryid","description")
df: org.apache.spark.sql.DataFrame = [categoryid: int, description: array<string>]
scala> df.createOrReplaceTempView("tab")
scala> spark.sql(""" select categoryid, filter(description, x -> lower(x)!=upper(x)) fw from tab """).show(false)
+----------+------------------------------+
|categoryid|fw |
+----------+------------------------------+
|33004 |[short, sarja, detalhe, couro]|
|22033 |[multipane, 220v, branco] |
|33033 |[x220, 220v] |
+----------+------------------------------+
scala>