在数据框中拆分文本并检查是否包含子字符串

时间:2019-05-15 17:00:01

标签: pyspark

因此,我想检查我的文本中是否包含“ baby”一词,而不是其他包含“ baby”的词。例如,“ maybaby”将不是匹配项。我已经有一段有效的代码,但是我想看看是否有更好的格式化方式,这样我就不必两次遍历数据了。到目前为止,这是我所拥有的:

import pyspark.sql.functions as F

rows = sc.parallelize([['14-banana'], ['12-cheese'], ['13-olives'], ['11-almonds'], ['23-maybaby'], ['54-baby']])

rows_df = rows.toDF(["ID"])
split = F.split(rows_df.ID, '-')

rows_df = rows_df.withColumn('fruit', split)

+----------+-------------+
|        ID|        fruit|
+----------+-------------+
| 14-banana| [14, banana]|
| 12-cheese| [12, cheese]|
| 13-olives| [13, olives]|
|11-almonds|[11, almonds]|
|23-maybaby|[23, maybaby]|
|   54-baby|   [54, baby]|
+----------+-------------+

from pyspark.sql.types import StringType
def func(col):
  for item in col:
    if item == "baby":
      return "yes"

  return "no"
func_udf = udf(func, StringType())
df_hierachy_concept = rows_df.withColumn('new',func_udf(rows_df['fruit']))

+----------+-------------+---+
|        ID|        fruit|new|
+----------+-------------+---+
| 14-banana| [14, banana]| no|
| 12-cheese| [12, cheese]| no|
| 13-olives| [13, olives]| no|
|11-almonds|[11, almonds]| no|
|23-maybaby|[23, maybaby]| no|
|   54-baby|   [54, baby]|yes|
+----------+-------------+---+

最终,我只想要“ ID”和“ new”列。

2 个答案:

答案 0 :(得分:2)

我将展示两种解决方法。可能还有很多其他方法可以达到相同的结果。

请参见以下示例:

from pyspark.shell import sc
from pyspark.sql.functions import split, when

rows = sc.parallelize(
    [
        ['14-banana'], ['12-cheese'], ['13-olives'], 
        ['11-almonds'], ['23-maybaby'], ['54-baby']
    ]
)

# Resolves with auxiliary column named "fruit"
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn('fruit', split(rows_df.ID, '-')[1])

rows_df = rows_df.withColumn('new', when(rows_df.fruit == 'baby', 'yes').otherwise('no'))
rows_df = rows_df.drop('fruit')
rows_df.show()

# Resolves directly without creating an auxiliary column
rows_df = rows.toDF(["ID"])
rows_df = rows_df.withColumn(
    'new',
     when(split(rows_df.ID, '-')[1] == 'baby', 'yes').otherwise('no')
)
rows_df.show()

# Resolves without forcing `split()[1]` call, avoiding out of index exception
rows_df = rows.toDF(["ID"])
is_new_udf = udf(lambda col: 'yes' if any(value == 'baby' for value in col) else 'no')
rows_df = rows_df.withColumn('new', is_new_udf(split(rows_df.ID, '-')))
rows_df.show()

所有输出均相同:

+----------+---+
|        ID|new|
+----------+---+
| 14-banana| no|
| 12-cheese| no|
| 13-olives| no|
|11-almonds| no|
|23-maybaby| no|
|   54-baby|yes|
+----------+---+

答案 1 :(得分:1)

为此,我将使用pyspark.sql.functions.regexp_extract。如果您能够提取两边带有单词边界的单词new,则使"yes"等于"baby",否则,请"no"

from pyspark.sql.functions import regexp_extract, when
rows_df.withColumn(
    'new',
    when(
        regexp_extract("ID", "(?<=(\b|\-))baby(?=(\b|$))", 0) == "baby",
        "yes"
    ).otherwise("no")
).show()
#+----------+-------------+---+
#|        ID|        fruit|new|
#+----------+-------------+---+
#| 14-banana| [14, banana]| no|
#| 12-cheese| [12, cheese]| no|
#| 13-olives| [13, olives]| no|
#|11-almonds|[11, almonds]| no|
#|23-maybaby|[23, maybaby]| no|
#|   54-baby|   [54, baby]|yes|
#+----------+-------------+---+

regexp_extract的最后一个参数是要提取的匹配项的索引。我们选择第一个索引(索引0)。如果模式不匹配,则返回一个空字符串。最后使用when()检查提取的字符串是否等于所需的值。

正则表达式模式表示:

  • (?<=(\b|\-)):单词边界(\b)或文字连字符(-)的正向后视。
  • baby:文字"baby"
  • (?=(\b|$)):对于单词边界或行尾($)的正向搜索。

此方法也不需要您先拆分字符串,因为不清楚您是否需要该部分。