好吧,相对较新的 pandas 和Python,如果我的问题非常明显,请道歉。已经浏览了所有关于合并,连接和连接的 pandas 文档,阅读了Stackoverflow和Scriptscoop上的所有类似问题,并观看了几个小时的 pandas 教程YouTube上。但还没有弄清楚如何做我想做的事情,这似乎在 pandas 中相对容易。
基本上我对每种阳性细菌结果都有一个DataFrame(大肠杆菌,金黄色葡萄球菌等)。在DataFrame中,有一个与患者相关联的唯一ID(Order),以及结果,日期和病房名称。对于一种类型的细菌或多种类型的患者,患者可以是阳性的,因此DataFrame之间的一些订单号重叠,而一些仅显示一次。
例如:
Order Test_EC Results_EC Date Ward Name
0 K70201 E. coli MODERATE 2014-01-02 North
1 K70277 E. coli MODERATE 2014-01-02 North
2 K70205 E. coli FEW 2014-01-02 West
3 K70818 E. coli MODERATE 2014-01-03 South
4 K70202 E. coli FEW 2014-01-03 West
5 K80070 E. coli RARE 2014-01-03 North
6 K80666 E. coli FEW 2014-01-03 East
Order Test_SA Results_SA Date Ward Name
0 K80766 S.aureus MANY 2014-01-01 West
1 K70201 S.aureus MANY 2014-01-02 North
2 K70277 S.aureus MANY 2014-01-02 North
3 K70205 S.aureus FEW 2014-01-02 West
4 K90107 S.aureus FEW 2014-01-06 North
我想根据患者的订单号创建一个主数据库,每个阳性测试和结果都有一个关联列,以及日期和病房名称。如果患者对一次测试呈阳性而对另一次测试呈阴性,那么NaN填充就可以了。如果来自不同DataFrames的两个订单号匹配,那么根据定义它们将具有相同的日期和区域名称,因此基本上测试和结果列将是唯一的新信息。
简而言之,我希望维护每个表中包含的所有信息,同时让每个订单号的所有相关数据显示在一行中。
我希望得到一些看起来像这样的东西:
Order Test_EC Results_EC Test_SA Results_SA Date Ward Name
0 K70201 E. coli MODERATE S.aureus MANY 2014-01-02 North
1 K70277 E. coli MODERATE S.aureus MANY 2014-01-02 North
2 K70205 E. coli FEW S.aureus FEW 2014-01-02 West
3 K70818 E. coli MODERATE NaN NaN 2014-01-03 South
4 K70202 E. coli FEW NaN NaN 2014-01-03 West
5 K80070 E. coli RARE NaN NaN 2014-01-03 North
6 K80666 E. coli FEW NaN NaN 2014-01-03 East
7 K80766 NaN NaN S.aureus MANY 2014-01-01 West
8 K90107 NaN NaN S.aureus FEW 2014-01-06 North
正如您所看到的,生成的DataFrame短三行,因为有三名患者同时感染了大肠杆菌和金黄色葡萄球菌。订单列中没有重复值,但所有信息都已保存。
我还想继续建立这样一个数据库,用不同的细菌做同样的事情大约二十次。实际数据集大约有100,000个唯一订单号。
如果我经历了我尝试过的各种连接,合并和连接函数的组合,以及为什么它们不起作用,这篇文章会太长。我知道我错过了一些明显的东西。任何想法,将不胜感激!
答案 0 :(得分:1)
看起来你想要一个'外部'合并?
In [154]: df1
Out[154]:
Order Test_EC Results_EC Date Ward Name
0 K70201 E. coli MODERATE 2014-01-02 North
1 K70277 E. coli MODERATE 2014-01-02 North
2 K70205 E. coli FEW 2014-01-02 West
3 K70818 E. coli MODERATE 2014-01-03 South
4 K70202 E. coli FEW 2014-01-03 West
5 K80070 E. coli RARE 2014-01-03 North
6 K80666 E. coli FEW 2014-01-03 East
In [155]: df2
Out[155]:
Order Test_SA Results_SA Date Ward Name
0 K80766 S.aureus MANY 2014-01-01 West
1 K70201 S.aureus MANY 2014-01-02 North
2 K70277 S.aureus MANY 2014-01-02 North
3 K70205 S.aureus FEW 2014-01-02 West
4 K90107 S.aureus FEW 2014-01-06 North
In [156]: df1.merge(df2, how='outer')
Out[156]:
Order Test_EC Results_EC Date Ward Name Test_SA Results_SA
0 K70201 E. coli MODERATE 2014-01-02 North S.aureus MANY
1 K70277 E. coli MODERATE 2014-01-02 North S.aureus MANY
2 K70205 E. coli FEW 2014-01-02 West S.aureus FEW
3 K70818 E. coli MODERATE 2014-01-03 South NaN NaN
4 K70202 E. coli FEW 2014-01-03 West NaN NaN
5 K80070 E. coli RARE 2014-01-03 North NaN NaN
6 K80666 E. coli FEW 2014-01-03 East NaN NaN
7 K80766 NaN NaN 2014-01-01 West S.aureus MANY
8 K90107 NaN NaN 2014-01-06 North S.aureus FEW