将python与spark数据帧一起使用如何使用列的值过滤数组

时间:2017-08-03 18:09:08

标签: apache-spark dataframe pyspark spark-dataframe

我正在尝试找到一种基于一列的值进行过滤以在另一列中搜索的方法。如果我在一列中有值,我想验证该值是否也在不同列中的数组中。

我尝试了以下内容:

df = sc.parallelize([('v1', ['v1','v2','v3']),('v4' ['v1','v2','v4'])]).toDF()
df.filter(pyspark.sql.functions.array_contains(df._2, df._1)).show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/pyspark/sql/functions.py", line 1648, in array_contains
    return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1088, in _build_args
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1075, in _get_args
  File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 512, in convert
  
    

TypeError:&#39;列&#39;对象不可调用

  

我正在寻找的是与

相似的东西
df.filter(pyspark.sql.functions.array_contains(df._2, 'v4'))

但我不想使用静态值。我想使用_1列的值。

1 个答案:

答案 0 :(得分:0)

你必须使用表达式:

df.filter("array_contains(_2, _1)").show()
+---+------------+
| _1|          _2|
+---+------------+
| v1|[v1, v2, v3]|
| v4|[v1, v2, v4]|
+---+------------+