我必须使用多种模式来过滤大文件。问题是我不确定使用rlike
应用多种模式的有效方法。例如
df = spark.createDataFrame(
[
('www 17 north gate',),
('aaa 45 north gate',),
('bbb 56 west gate',),
('ccc 56 south gate',),
('Michigan gate',),
('Statue of Liberty',),
('57 adam street',),
('19 west main street',),
('street burger',)
],
[ 'poi']
)
df.show()
+-------------------+
| poi|
+-------------------+
| www 17 north gate|
| aaa 45 north gate|
| bbb 56 west gate|
| ccc 56 south gate|
| Michigan gate|
| Statue of Liberty|
| 57 adam street|
|19 west main street|
| street burger|
+-------------------+
如果我从数据中使用以下两种模式,我可以做到
pat1="(aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$"
pat2="[0-9]+ [a-z\s]+ street$"
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).show()
+-----------------+
| poi|
+-----------------+
|www 45 north gate|
| Michigan gate|
|Statue of Liberty|
| street burger|
+-----------------+
如果我有40种不同的图案怎么办?我想我可以使用这样的循环
for pat in [pat1,pat2,....,patn]:
df = df.filter(~df['poi'].rlike(pat))
这是正确的方法吗?原始数据为中文,因此请忽略模式是否有效。我只是想看看我如何处理多个正则表达式模式。
答案 0 :(得分:2)
您建议的两种方法都具有相同的执行计划:
连续使用两种模式:
df.filter(~df['poi'].rlike(pat2)).filter(~df['poi'].rlike(pat1)).explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$ &&
# NOT poi#297 RLIKE (aaa|bbb|ccc) [0-#9]+ (north|south|west|east) gate$)
#+- Scan ExistingRDD[poi#297]
使用循环:
# this is the same as your loop
df_new = reduce(lambda df, pat: df.filter(~df['poi'].rlike(pat)), [pat1, pat2], df)
df_new.explain()
#== Physical Plan ==
#*Filter (NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$ &&
# NOT poi#297 RLIKE [0-9]+ [a-z\s]+ street$)
#+- Scan ExistingRDD[poi#297]
另一种方法是使用"|".join()
将所有模式与正则表达式or
运算符链接在一起。主要区别在于,这只会导致对rlike
的一个调用(与另一种方法中每个模式的一个调用相反):
df.filter(~df['poi'].rlike("|".join([pat1, pat2]))).explain()
#== Physical Plan ==
#*Filter NOT poi#297 RLIKE (aaa|bbb|ccc) [0-9]+ (north|south|west|east) gate$|[0-9]+ [a-#z\s]+ street$
#+- Scan ExistingRDD[poi#297]