如何在PySpark SQL / Dataframes中使用RLIKE中的单词边界

时间:2018-04-15 04:50:55

标签: apache-spark spark-dataframe pyspark-sql

我试图在我的Spark SQL / Dataframe查询中使用RLIKE中的单词边界,但它似乎不起作用。

1. GoTo 'settings' and 'Project Interpreter'
2. Click on 'Add' and you will get a pop-up.
3. Select 'New environment' and for 'Base interpreter' add the 'python.exe' location from the python3.6 directory of your local drive.
4. click 'Ok'
5. Then you can select the required packages by clicking '+' and searching for package name

有什么问题?我也试过from pyspark.sql.functions import * usersDf.select('id', 'display_name', 'location') \ .where(expr('location RLIKE "\\b(United States|America|USA|US)\\b"')) \ .limit(20) \ .show(20, False) ......

1 个答案:

答案 0 :(得分:0)

你没有逃脱。

df = spark.createDataFrame([" US ", "FUSS"], "string")
df.where("value RLIKE '\\\\bUS\\\\b'").show()

# +-----+
# |value|
# +-----+
# |  US |
# +-----+

df.where("value NOT RLIKE '\\\\bUS\\\\b'").show()
# +-----+
# |value|
# +-----+
# | FUSS|
# +-----+

所以它应该是

'location RLIKE "\\\\b(United States|America|USA|US)\\\\b"'

如果你检查执行计划,你会看到。你的

df.where("value NOT RLIKE '\\bUS\\b'").explain()
# == Physical Plan ==
# *(1) Filter (isnotnull(value#33) && NOT value#33 RLIKEU)
# +- Scan ExistingRDD[value#33]

与正确的相比:

df.where("value NOT RLIKE '\\\\bUS\\\\b'").explain()
# == Physical Plan ==
# *(1) Filter (isnotnull(value#33) && NOT value#33 RLIKE \bUS\b)
# +- Scan ExistingRDD[value#33]