我有两个DataFrames
,其中有一些这样的数据,
+-------+--------+------------------+---------+
|ADDRESS|CUSTOMER| CUSTOMERTIME| POL |
+-------+--------+------------------+---------+
| There| cust0|3069.4768999023245|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| There| cust0|3069.4768999023245|43831451 |
| Here| cust1| 15.29206776391711|578596829|
| Here| cust4| 32.04741866436953|43831451 |
+-------+--------+------------------+---------+
和
+---------+------------------+------------------+-----+-----+
| POLICY| POLICYENDTIME| POLICYSTARTTIME|PVAR0|PVAR1|
+---------+------------------+------------------+-----+-----+
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
|578596829|3599.3427299724353|13.433243831334922| 2| 0|
| 43831451|3712.2672901111655|1744.9884452423225| 0| 6|
| 43831451|3979.2754016079016|3712.2672901111655| 0| 5|
+---------+------------------+------------------+-----+-----+
现在我想比较这两个数据框,找到我可以在下一步中加入这些DataFrames
的匹配列(在这种情况下,它将是POLICY
和POL
) 。是否有任何算法或其他方法可以预测这个?
答案 0 :(得分:0)
鉴于df1
和df2
,您可以通过
df1 = sc.parallelize([('1',),('2',)]).toDF(['a'])
df2 = sc.parallelize([('1','2'),('2','3')]).toDF(['a','b'])
>>>set(df1.columns).intersection(set(df2.columns))
set(['a'])
>>>list(set(df1.columns).intersection(set(df2.columns)))
['a']
这应该有所不同
>>> list(set(df1.columns).symmetric_difference(set(df2.columns)))
['b']