如何检查实例是否在pyspark中的数据框中并从数据帧中获取?

时间:2017-09-14 12:01:30

标签: dataframe pyspark row pyspark-sql find-occurrences

我有一个从具有3个不同属性的数据框中提取的实例:Atr1,Atr2和Atr3。

另一方面,我有一个包含4个属性的数据帧:Atr1,Atr2,Atr3,Atr4,但Atributes Atr1,Atr2和Atr3与前面提到的实例相同。我有这样的事情:

public class A { 
    public String fieldToSearch = "someSearchPattern";
}

public class B {
    public void methodWhichContainsCallOfFieldToSearch() {
        A a = new A();
        sout(a.fieldToSearch);
    }
}

因此,有了上面的实例,我想检查数据框中是否存在具有属性Atr1,Atr2和Atr3的值的实例,如果存在,则获取Atr4的值。在这种情况下,'我'。

2 个答案:

答案 0 :(得分:0)

这是一个可接受的答案吗?

df[(df['Atr1'] == row.Atr1) & (df['Atr2'] == row.Atr2) & (df['Atr3'] == row.Atr3)].Atr4

row行和df您提到的数据框。

答案 1 :(得分:0)

希望这有帮助!

from pyspark.sql.types import Row
from pyspark.sql.functions import col

#sample data
row_list = [Row(Atr1=u'A', Atr2=u'B', Atr3=24),
            Row(Atr1=u'E', Atr2=u'F', Atr3=20),]
df = sc.parallelize([('C', 'B', 21, 'H'),
                     ('D', 'B', 21, 'J'),
                     ('E', 'B', 21, 'K'),
                     ('A', 'B', 24, 'I')]).\
    toDF(["Atr1", "Atr2", "Atr3", "Atr4"])

search_df = df.join(sqlContext.createDataFrame(row_list), ["Atr1", "Atr2", "Atr3"], "right").\
    withColumn("rowItem_Exist", col('Atr4').isNotNull())
search_df.show()

输出是:

+----+----+----+----+-------------+
|Atr1|Atr2|Atr3|Atr4|rowItem_Exist|
+----+----+----+----+-------------+
|   E|   F|  20|null|        false|
|   A|   B|  24|   I|         true|
+----+----+----+----+-------------+