在Pyspark中,可以使用以下代码过滤数组:
lines.filter(lambda line: "some" in line)
但我已从json文件中读取数据并将其标记化。现在它有以下形式:
df=[Row(text=u"i have some text", words=[u'I', u'have', u"some'", u'text'])]
如何从单词数组中过滤掉“some”?
答案 0 :(得分:4)
您可以使用array_contains
,它自1.4以来可用:
from pyspark.sql import Row
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([Row(text=u"i have some text", words=[u'I', u'have', u'some', u'text'])])
df.withColumn("keep", F.array_contains(df.words,"some")) \
.filter(F.col("keep")==True).show()
# +----------------+--------------------+----+
# | text| words|keep|
# +----------------+--------------------+----+
# |i have some text|[I, have, some, t...|true|
# +----------------+--------------------+----+
如果你想过滤掉'某些',就像我在评论中说的那样,你可以使用StopWordsRemover
API
from pyspark.ml.feature import StopWordsRemover
StopWordsRemover(inputCol="words", stopWords=["some"]).transform(df)