大熊猫结合数据帧

时间:2016-06-03 18:50:21

标签: python pandas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array

javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)

此代码包含以下内容:

                Town  Java Jobs
435          York,NY       3593
212       NewYork,NY       3585
584       Seattle,WA       2080
624       Chicago,IL       1920
301        Boston,MA       1571
...
79        Holland,MI          5
38      Manhattan,KS          5
497        Vernon,IL          5
30        Clayton,MO          5
90       Waukegan,IL          5

[653 rows x 2 columns] 

                 Town  Python Jobs
160       NewYork,NY         2949
11           York,NY         2938
349       Seattle,WA         1321
91        Chicago,IL         1312
167        Boston,MA         1117

383       Hanover,NH            5
209      Bulverde,TX            5
203     Salisbury,NC            5
67       Rockford,IL            5
256       Ventura,CA            5

[416 rows x 2 columns]

我想创建一个使用城镇名称作为索引的新数据框,并为每个java和python都有一个列。但是,有些城镇只会有其中一种语言的结果。

2 个答案:

答案 0 :(得分:3)

import pandas as pd

javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
     'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
     'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])

result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
#               Java Jobs  Python Jobs
# Town                                
# York,NY          3593.0       2938.0
# NewYork,NY       3585.0       2949.0
# Seattle,WA       2080.0       1321.0
# Chicago,IL       1920.0       1312.0
# Boston,MA        1571.0       1117.0
# Holland,MI          5.0          NaN
# Manhattan,KS        5.0          NaN
# Vernon,IL           5.0          NaN
# Clayton,MO          5.0          NaN
# Waukegan,IL         5.0          NaN
# Hanover,NH          NaN          5.0
# Bulverde,TX         NaN          5.0
# Salisbury,NC        NaN          5.0
# Rockford,IL         NaN          5.0
# Ventura,CA          NaN          5.0
默认情况下,

pd.merge会在共享的所有列上加入两个DataFrame。在这种情况下,javaFramepythonFrame仅共享Town列。因此,默认情况下pd.merge会加入Town列上的两个DataFrame。

how='outer会导致pd.merge使用union of the keys from both frames。换句话说,即使只有一个DataFrame包含pd.merge,它也会导致javaFrame返回其数据来自pythonFrameTown的行。缺少的数据填充NaN s。

答案 1 :(得分:1)

使用pd.concat

df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)

              Java Jobs  Python Jobs
Boston,MA        1571.0       1117.0
Bulverde,TX         NaN          5.0
Chicago,IL       1920.0       1312.0
Clayton,MO          5.0          NaN
Hanover,NH          NaN          5.0
Holland,MI          5.0          NaN
Manhattan,KS        5.0          NaN
NewYork,NY       3585.0       2949.0
Rockford,IL         NaN          5.0
Salisbury,NC        NaN          5.0
Seattle,WA       2080.0       1321.0
Ventura,CA          NaN          5.0
Vernon,IL           5.0          NaN
Waukegan,IL         5.0          NaN
York,NY          3593.0       2938.0