如何只返回列数值在指定列表中的Spark DataFrame的行?
这是我的Python pandas执行此操作的方式:
df_start = df[df['name'].isin(['App Opened', 'App Launched'])].copy()
我看到this SO scala实现并尝试了几种排列,但无法使其工作。
这是使用pyspark进行尝试失败的尝试:
df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])
输出:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 253, in <module>
code = compile('\n'.join(final_code), '<stdin>', 'exec', ast.PyCF_ONLY_AST, 1)
File "<stdin>", line 18
df_start = df_spark.filter(col("name") isin ['App Opened', 'App Launched'])
^
SyntaxError: invalid syntax
另一次尝试:
df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))
输出:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 267, in <module>
raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6660042787423349557.py", line 260, in <module>
exec(code)
File "<stdin>", line 18, in <module>
NameError: name 'col' is not defined
答案 0 :(得分:6)
正如dmdmdmdmdmd在评论中指出的那样,第二种方法不起作用,因为需要导入col
:
from pyspark.sql.functions import col
df_start = df_spark.filter(col("name").isin(['App Opened', 'App Launched']))
这是完成过滤器的另一种方法:
df_start = df_spark.filter(df_spark.name.isin(['App Opened', 'App Launched']))