Question

我必须使用多种模式来过滤大文件。问题是我不确定使用rlike应用多种模式的有效方法。例如

df = spark.createDataFrame(
    [
        ('www 17 north gate',),
        ('aaa 45 north gate',),
        ('bbb 56 west gate',),
        ('ccc 56 south gate',),
        ('Michigan gate',),
        ('Statue of Liberty',),
        ('57 adam street',),
        ('19 west main street',),
        ('street burger',)
    ],
    [ 'poi']
)

df.show()
+-------------------+
|                poi|
+-------------------+
|  www 17 north gate|
|  aaa 45 north gate|
|   bbb 56 west gate|
|  ccc 56 south gate|
|      Michigan gate|
|  Statue of Liberty|
|     57 adam street|
|19 west main street|
|      street burger|
+-------------------+

如果我从数据中使用以下两种模式，我可以做到

pat1="(aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$"
pat2="[0-9]+ [a-z\s]+ street$"
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).show()
+-----------------+
|              poi|
+-----------------+
|www 45 north gate|
|    Michigan gate|
|Statue of Liberty|
|    street burger|
+-----------------+

如果我有40种不同的图案怎么办？我想我可以使用这样的循环

for pat in [pat1,pat2,....,patn]:
    df = df.filter(~df['poi'].rlike(pat))

这是正确的方法吗？原始数据为中文，因此请忽略模式是否有效。我只是想看看我如何处理多个正则表达式模式。

Answer 1

您建议的两种方法都具有相同的执行计划：

连续使用两种模式：

df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$ && 
#         NOT poi#297 RLIKE (aaa|bbb|ccc) [0-#9]+ (north|south|west|east) gate$)
#+- Scan ExistingRDD[poi#297]

使用循环：

# this is the same as your loop
df_new = reduce(lambda df, pat: df.filter(~df['poi'].rlike(pat)), [pat1, pat2], df)
df_new.explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$ && 
#         NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$)
#+- Scan ExistingRDD[poi#297]

另一种方法是使用"|".join()将所有模式与正则表达式or运算符链接在一起。主要区别在于，这只会导致对rlike的一个调用（与另一种方法中每个模式的一个调用相反）：

df.filter(~df['poi'].rlike("|".join([pat1, pat2]))).explain()
#== Physical Plan ==
#*Filter NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$|[0-9]+ [a-#z\s]+ street$
#+- Scan ExistingRDD[poi#297]

如何在pyspark中使用Rlike使用多个正则表达式模式

1 个答案: