下面是我尝试运行的pyspark代码。我无法将值替换为filter。请指教。
>>> coreWordFilter = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter
"crawlResult.url.like('%furniture%')"
>>> preFilter = crawlResult.filter(coreWordFilter)
20/02/11 09:19:54 INFO execution.SparkSqlParser: Parsing command: crawlResult.url.like('%furniture%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line 1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name 'crawlResult.url.like'(line 1, pos 0)\n\n== SQL ==\ncrawlResult.url.like('%furniture%')\n^^^\n"
>>> preFilter = crawlResult.filter(crawlResult.url.like('%furniture%'))
>>>
我需要一些有关如何添加更多crawlResult.url.like逻辑的帮助: 从今天2/12/2020开始的代码:
>>> coreWordFilter = crawlResult.url.like('%{}%'.format(IncoreWords[0]))
>>> coreWordFilter
Column<url LIKE %furniture%>
>>> InmoreWords
['couch', 'couches']
>>> for a in InmoreWords:
... coreWordFilter=coreWordFilter+" | crawlResult.url.like('%"+a+"%')"
>>> coreWordFilter
Column<((((((url LIKE %furniture% + | crawlResult.url.like('%) + couch) + %')) + | crawlResult.url.like('%) + couches) + %'))>
preFilter = crawlResult.filter(coreWordFilter)不适用于上述coreWordFilter。 我希望可以执行以下操作,但无法执行-出现错误:
>>> coreWordFilter2 = "crawlResult.url.like('%"+IncoreWords[0]+"%')"
>>> coreWordFilter2
"crawlResult.url.like('%furniture%')"
>>> for a in InmoreWords:
... coreWordFilter2=coreWordFilter2+" | crawlResult.url.like('%"+a+"%')"
...
>>> coreWordFilter2
"crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')"
>>> preFilter = crawlResult.filter(coreWordFilter2)
20/02/12 08:55:26 INFO execution.SparkSqlParser: Parsing command:
crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%') |
crawlResult.url.like('%couches%')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/dataframe.py", line
1078, in filter
jdf = self._jdf.filter(condition)
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/lib/py4j-0.10.4-
src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/apps/cloudera/parcels/SPARK2-2.2.0.cloudera2-
1.cdh5.12.0.p0.232957/lib/spark2/python/pyspark/sql/utils.py", line 73, in
deco
raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: u"\nUnsupported function name
'crawlResult.url.like'(line 1, pos 0)\n\n== SQL
==\ncrawlResult.url.like('%furniture%') |
crawlResult.url.like('%couch%') | crawlResult.url.like('%couches%')\n^^^\n"
我认为正确的语法是:
preFilter = crawlResult.filter(crawlResult.url.like('%furniture%') | crawlResult.url.like('%couch%'))
答案 0 :(得分:0)
由于您需要动态or
条件,因此我认为基于String
运算符(AND
,OR
,NOT
等)的过滤会很容易与基于Column
的逻辑运算符(&
,|
,~
等)进行比较。
虚拟数据框和列表:
crawlResult.show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| table|
| 1| test-test|
| 1| couch|
+---+--------------+
# IncoreWords
# ['furniture', 'office-table', 'counch', 'blah']
# InmoreWords
# ['couch', 'couches']
现在,我只是按照您的OP顺序来构建动态filter
子句,但这将带给您广泛的想法。
coreWordFilter2 = "url like ('%"+IncoreWords[0]+"%')"
# coreWordFilter2
#"url like ('%furniture%')"
for a in InmoreWords:
coreWordFilter2=coreWordFilter2+" or url like('%"+a+"%')"
# coreWordFilter2
# "url like ('%furniture%') or url like('%couch%') or url like('%couches%')"
crawlResult.filter(coreWordFilter2).show()
+---+--------------+
| id| url|
+---+--------------+
| 1|test-furniture|
| 1| couch|
+---+--------------+