我正在尝试找到一种基于一列的值进行过滤以在另一列中搜索的方法。如果我在一列中有值,我想验证该值是否也在不同列中的数组中。
我尝试了以下内容:
df = sc.parallelize([('v1', ['v1','v2','v3']),('v4' ['v1','v2','v4'])]).toDF()
df.filter(pyspark.sql.functions.array_contains(df._2, df._1)).show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/pyspark/sql/functions.py", line 1648, in array_contains
return Column(sc._jvm.functions.array_contains(_to_java_column(col), value))
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1124, in __call__
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1088, in _build_args
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1075, in _get_args
File "/usr/local/Cellar/apache-spark/2.1.1/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 512, in convert
TypeError:&#39;列&#39;对象不可调用
我正在寻找的是与
相似的东西df.filter(pyspark.sql.functions.array_contains(df._2, 'v4'))
但我不想使用静态值。我想使用_1列的值。
答案 0 :(得分:0)
你必须使用表达式:
df.filter("array_contains(_2, _1)").show()
+---+------------+
| _1| _2|
+---+------------+
| v1|[v1, v2, v3]|
| v4|[v1, v2, v4]|
+---+------------+