我在PySpark数据框中有一列句子,其标准化文本如下:
{产品} {number} {number}是在{date}购买的,并退回了
规范文本以{}标记,例如{number}或{date}
我需要删除所有归一化的单词,以使句子变得像这样:
已购买并返回
有什么建议吗?
开始写作,但后来陷入困境:
data.filter(data.sentence.contains('{'))
答案 0 :(得分:0)
我认为最简单的方法是在每个字段上使用正则表达式替换,以捕获{}之间的所有内容,并用空字符串替换。
data = [(1, '{product} {number} {number} was purchased on {date} and
returned')]
df = spark.createDataFrame(data, ["ix", "string"])
# here I created a new column called new_col replacing everything
# that matches the regular expression with an empty string
df = df.withColumn('new_col', F.regexp_replace(F.col("string"), "\\{(.*?)\}", ""))
df.show()
输出:
+---+--------------------+--------------------+
| ix| string| new_col|
+---+--------------------+--------------------+
| 1|{product} {number...| was purchased ...|
+---+--------------------+--------------------+