问题与熊猫合并两个Excel文件

时间:2019-03-07 12:52:16

标签: python pandas

我有两个Excel文件。

这些文件唯一的共同点是dbsid。

在第一个excel(SQL)中,dbsid称为“示例卡的ID”,在另一个(EMEA)中,dbsid称为“条形码”

import pandas as pd

excel_file = "eu-tracker.xlsx"
sql = pd.read_excel(excel_file, sheet_name=0, date_parser=True)
emea = pd.read_excel(excel_file, sheet_name=1, date_parser=True)

sql.drop_duplicates(inplace=True)
emea.drop_duplicates(inplace=True)

data = pd.merge(left=sql, right=emea, left_on="ID of Sample Card", right_on="Barcode", how="left")

SQL数据框:

      "OrderID"   "Creation Date"   "User ID"   "Days in Lab"   "Gender"    "Sample Date"   "ID of Sample Card" "System Sample ID"  "OrderStatus"       "Sample Received"       ...
493     1234         10.11.1900      20202           3           Male        10.11.1900          5050123            1234             REPORT_AVAILABLE       13.11.1900          ...

EMEA数据框:

        "Barcode"   "Eingangsdatum" "Befunddatum "Befunddatum     "Befunddatum  "Biochemie   "Biochemie     "Ergebnis   "Biochemistry "Diagnosis"   "Diagnosis_2"   "Labornumber"   "Age"   "Sex"
                                 Biochemie"   Biochemie2"     Lyso-GL-1"    Ergebnis"    Ergebnis2"     Lyso-GL-1"  report"     
3123     5050123     13.11.1900      22.11.1900   22.11.1900       23.01.1900   0,178852201   20,11343324     165,4     aberrant        Gaucher      Niemann Pick       184094       65       M

预期数据数据框:

         "OrderID"  "Creation Date"   "User ID" "Days in Lab"   "Gender"    "Sample Date"   "ID of Sample Card" "System Sample ID"  "OrderStatus"       "Sample Received"       ...     "Eingangsdatum" "Befunddatum    "Befunddatum    "Befunddatum    "Biochemie      "Biochemie     "Ergebnis   "Biochemistry "Diagnosis"    "Diagnosis_2"   "Labornumber"   "Age"   "Sex"
                                                                                                                                                                                                    Biochemie"       Biochemie2"     Lyso-GL-1"      Ergebnis"       Ergebnis2"     Lyso-GL-1"  report"     
493        1234      10.11.1900        20202          3          Male        10.11.1900          5050123            1234             REPORT_AVAILABLE       13.11.1900          ...     13.11.1900       22.11.1900      22.11.1900      23.01.1900      0,178852201     20,11343324     165,4      aberrant       Gaucher       Niemann Pick      184094        65       M

我得到的数据数据帧:

        "OrderID"   "Creation Date" "User ID"   "Days in Lab"   "Gender"    "Sample Date"   "ID of Sample Card" "System Sample ID"  "OrderStatus"       "Sample Received"       ...     "Eingangsdatum" "Befunddatum    "Befunddatum    "Befunddatum    "Biochemie      "Biochemie     "Ergebnis   "Biochemistry "Diagnosis"    "Diagnosis_2"   "Labornumber"   "Age"   "Sex"
                                                                                                                                                                                                    Biochemie"       Biochemie2"     Lyso-GL-1"      Ergebnis"       Ergebnis2"     Lyso-GL-1"  report"     
493     1234        10.11.1900       20202           3          Male         10.11.1900          5050123            1234             REPORT_AVAILABLE       13.11.1900          ...         NaN             NaN              NaN            NaN             NaN             NaN           NaN         NaN           NaN             NaN             NaN          NaN     NaN

SQL数据框信息:

RangeIndex: 2443 entries, 0 to 2442
Data columns (total 64 columns):
OrderID                                                                      2443 non-null float64
Creation Date                                                                2443 non-null datetime64[ns]
User ID                                                                      2443 non-null float64
Days in Lab                                                                  2443 non-null object
Gender                                                                       2443 non-null object
Sample Date                                                                  2443 non-null datetime64[ns]
ID of Sample Card                                                            2443 non-null object
System Sample ID                                                             2443 non-null float64
OrderStatus                                                                  2443 non-null object
Sample Received                                                              2443 non-null object
dtypes: datetime64[ns](2), float64(3), int64(41), object(18)
memory usage: 1.2+ MB

Emea数据框信息:

RangeIndex: 3134 entries, 0 to 3133
Data columns (total 14 columns):
Barcode                   3134 non-null object
Eingangsdatum             3134 non-null datetime64[ns]
Befunddatum Biochemie     2973 non-null object
Befunddatum Biochemie2    1413 non-null object
Befunddatum Lyso-GL-1     151 non-null object
Biochemie Ergebnis        2973 non-null float64
Biochemie Ergebnis2       1476 non-null float64
Ergebnis Lyso-GL-1        151 non-null float64
Biochemistry report       3134 non-null object
Diagnosis                 2972 non-null object
Diagnosis_2               1475 non-null object
Labornummer               3134 non-null object
Alter                     3134 non-null int64
Sex                       3134 non-null object
dtypes: datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 342.9+ KB

执行这些步骤后,文件将具有更多的标头,而没有其他文件中的数据。我也尝试加入,但效果不佳。

我不知道该如何将两者结合起来。

2 个答案:

答案 0 :(得分:0)

sql.["ID of Sample Card"]emea.["Barcode"]均为object数据类型。我无法从原始问题中的样本数据中确定它们是否具有前导或尾随空格,但是即使数据看起来相同,也可能使两个数据框的合并变得混乱。

如果您确信两列都是数字列和非空列,则可以使用astype将它们转换为整数,但是您可能需要首先清理数据。例如:

sql["ID of Sample Card"] = sql["ID of Sample Card"].str.strip().astype('int')
emea["Barcode"] = emea["Barcode"].str.strip().astype('int')

答案 1 :(得分:0)

问题是两个系列的对象类型都在哪里。

将两个系列都转换为整数

sql["ID of Sample Card"] = pd.to_numeric(sql["ID of Sample Card"], errors="coerce", downcast="integer")
emea["Barcode"] = pd.to_numeric(emea["Barcode"], errors="coerce", downcast="integer")

之后,我可以毫无问题地合并它们

data = pd.merge(left=sql, right=emea, left_on="ID of Sample Card", right_on="Barcode", how="left")

与以上答案的区别在于,该系列中的所有非数字字段均为NaN