如何检查Pyspark中的实例是否在数据框中?

时间:2017-09-04 08:37:32

标签: apache-spark dataframe filter pyspark instance

我有一个从数据帧df1中提取的实例,我想检查该实例是否在Pyspark中的另一个数据帧df2中。有办法面对吗?

例如:

实例:

+------+------+------+
| Atr1 | Atr2 | Atr3 |
+------+------+------+
|  'A' |   2  |  'B' |
+------+------+------+

数据帧:

+------+------+------+
| Atr1 | Atr2 | Atr3 |
+------+------+------+
|  'C' |   1  |  'B' |
+------+------+------+
|  'D' |   2  |  'A' |
+------+------+------+
|  'E' |   2  |  'C' |
+------+------+------+
|  'A' |   2  |  'B' |
+------+------+------+

这样,我希望得到真实,因为实例位于数据框(第4行)中。

感谢。

2 个答案:

答案 0 :(得分:1)

Pyspark不是正确的语言,但仍然是:

首先,让我们创建我们的数据框:

df1 = spark.createDataFrame(sc.parallelize([['A', 2, 'B']]), ['Atr1', 'Atr2', 'Atr3'])
df2 = spark.createDataFrame(sc.parallelize([['C',1,'B'],['D',2,'A'],['E',2,'C'],['A',2,'B']]), ['Atr1', 'Atr2', 'Atr3'])

你可以使用:

  • subtract

    df1.subtract(df2).count() == 0
    
  • join

    df2.join(df1, ['Atr1', 'Atr2', 'Atr3']).count() > 0
    
  • filter

    df2.filter((df2.Atr1 == 'A') & (df2.Atr2 == 2) & (df2.Atr3 == 'B')).count() > 0
    

希望这有帮助!

答案 1 :(得分:0)

您可以选择df1df2的交点,并比较df1的数量是否等于交叉点的数量,如下所示:

>>> df1 = spark.createDataFrame(sc.parallelize([['A', 2, 'B']]), ['Atr1', 'Atr2', 'Atr3'])
>>> df2 = spark.createDataFrame(sc.parallelize([['C',1,'B'],['D',2,'A'],['E',2,'C'],['A',2,'B']]), ['Atr1', 'Atr2', 'Atr3'])
>>> df1.show() 
+----+----+----+
|Atr1|Atr2|Atr3|
+----+----+----+
|   A|   2|   B|
+----+----+----+

>>> df2.show() 
+----+----+----+
|Atr1|Atr2|Atr3|
+----+----+----+
|   C|   1|   B|
|   D|   2|   A|
|   E|   2|   C|
|   A|   2|   B|
+----+----+----+

>>> df2.intersect(df1).count() == df1.count() 
True
>>> 

有关pyspark.sql.DataFrame.intersect的信息,请查看文档here