使用Spark在两个数据帧中查找类似的列

时间:2018-02-02 13:27:04

标签: java apache-spark apache-spark-mllib

我有两个DataFrames,其中有一些这样的数据,

+-------+--------+------------------+---------+
|ADDRESS|CUSTOMER|      CUSTOMERTIME|   POL   |
+-------+--------+------------------+---------+
|  There|   cust0|3069.4768999023245|578596829|
|  There|   cust0|3069.4768999023245|43831451 |
|   Here|   cust1| 15.29206776391711|578596829|
|  There|   cust0|3069.4768999023245|43831451 |
|   Here|   cust1| 15.29206776391711|578596829|
|   Here|   cust4| 32.04741866436953|43831451 |
+-------+--------+------------------+---------+

 +---------+------------------+------------------+-----+-----+
 |   POLICY|     POLICYENDTIME|   POLICYSTARTTIME|PVAR0|PVAR1|
 +---------+------------------+------------------+-----+-----+
 |578596829|3599.3427299724353|13.433243831334922|    2|    0|
 |578596829|3599.3427299724353|13.433243831334922|    2|    0|
 | 43831451|3712.2672901111655|1744.9884452423225|    0|    6|
 |578596829|3599.3427299724353|13.433243831334922|    2|    0|
 | 43831451|3712.2672901111655|1744.9884452423225|    0|    6|
 | 43831451|3979.2754016079016|3712.2672901111655|    0|    5|
 +---------+------------------+------------------+-----+-----+

现在我想比较这两个数据框,找到我可以在下一步中加入这些DataFrames的匹配列(在这种情况下,它将是POLICYPOL) 。是否有任何算法或其他方法可以预测这个?

1 个答案:

答案 0 :(得分:0)

鉴于df1df2,您可以通过

找到常用列
df1 = sc.parallelize([('1',),('2',)]).toDF(['a'])
df2 = sc.parallelize([('1','2'),('2','3')]).toDF(['a','b'])

>>>set(df1.columns).intersection(set(df2.columns))
set(['a'])

>>>list(set(df1.columns).intersection(set(df2.columns)))
['a']

这应该有所不同

>>> list(set(df1.columns).symmetric_difference(set(df2.columns)))
['b']