我有两个Excel文件。
这些文件唯一的共同点是dbsid。
在第一个excel(SQL)中,dbsid称为“示例卡的ID”,在另一个(EMEA)中,dbsid称为“条形码”
import pandas as pd
excel_file = "eu-tracker.xlsx"
sql = pd.read_excel(excel_file, sheet_name=0, date_parser=True)
emea = pd.read_excel(excel_file, sheet_name=1, date_parser=True)
sql.drop_duplicates(inplace=True)
emea.drop_duplicates(inplace=True)
data = pd.merge(left=sql, right=emea, left_on="ID of Sample Card", right_on="Barcode", how="left")
SQL数据框:
"OrderID" "Creation Date" "User ID" "Days in Lab" "Gender" "Sample Date" "ID of Sample Card" "System Sample ID" "OrderStatus" "Sample Received" ...
493 1234 10.11.1900 20202 3 Male 10.11.1900 5050123 1234 REPORT_AVAILABLE 13.11.1900 ...
EMEA数据框:
"Barcode" "Eingangsdatum" "Befunddatum "Befunddatum "Befunddatum "Biochemie "Biochemie "Ergebnis "Biochemistry "Diagnosis" "Diagnosis_2" "Labornumber" "Age" "Sex"
Biochemie" Biochemie2" Lyso-GL-1" Ergebnis" Ergebnis2" Lyso-GL-1" report"
3123 5050123 13.11.1900 22.11.1900 22.11.1900 23.01.1900 0,178852201 20,11343324 165,4 aberrant Gaucher Niemann Pick 184094 65 M
预期数据数据框:
"OrderID" "Creation Date" "User ID" "Days in Lab" "Gender" "Sample Date" "ID of Sample Card" "System Sample ID" "OrderStatus" "Sample Received" ... "Eingangsdatum" "Befunddatum "Befunddatum "Befunddatum "Biochemie "Biochemie "Ergebnis "Biochemistry "Diagnosis" "Diagnosis_2" "Labornumber" "Age" "Sex"
Biochemie" Biochemie2" Lyso-GL-1" Ergebnis" Ergebnis2" Lyso-GL-1" report"
493 1234 10.11.1900 20202 3 Male 10.11.1900 5050123 1234 REPORT_AVAILABLE 13.11.1900 ... 13.11.1900 22.11.1900 22.11.1900 23.01.1900 0,178852201 20,11343324 165,4 aberrant Gaucher Niemann Pick 184094 65 M
我得到的数据数据帧:
"OrderID" "Creation Date" "User ID" "Days in Lab" "Gender" "Sample Date" "ID of Sample Card" "System Sample ID" "OrderStatus" "Sample Received" ... "Eingangsdatum" "Befunddatum "Befunddatum "Befunddatum "Biochemie "Biochemie "Ergebnis "Biochemistry "Diagnosis" "Diagnosis_2" "Labornumber" "Age" "Sex"
Biochemie" Biochemie2" Lyso-GL-1" Ergebnis" Ergebnis2" Lyso-GL-1" report"
493 1234 10.11.1900 20202 3 Male 10.11.1900 5050123 1234 REPORT_AVAILABLE 13.11.1900 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
SQL数据框信息:
RangeIndex: 2443 entries, 0 to 2442
Data columns (total 64 columns):
OrderID 2443 non-null float64
Creation Date 2443 non-null datetime64[ns]
User ID 2443 non-null float64
Days in Lab 2443 non-null object
Gender 2443 non-null object
Sample Date 2443 non-null datetime64[ns]
ID of Sample Card 2443 non-null object
System Sample ID 2443 non-null float64
OrderStatus 2443 non-null object
Sample Received 2443 non-null object
dtypes: datetime64[ns](2), float64(3), int64(41), object(18)
memory usage: 1.2+ MB
Emea数据框信息:
RangeIndex: 3134 entries, 0 to 3133
Data columns (total 14 columns):
Barcode 3134 non-null object
Eingangsdatum 3134 non-null datetime64[ns]
Befunddatum Biochemie 2973 non-null object
Befunddatum Biochemie2 1413 non-null object
Befunddatum Lyso-GL-1 151 non-null object
Biochemie Ergebnis 2973 non-null float64
Biochemie Ergebnis2 1476 non-null float64
Ergebnis Lyso-GL-1 151 non-null float64
Biochemistry report 3134 non-null object
Diagnosis 2972 non-null object
Diagnosis_2 1475 non-null object
Labornummer 3134 non-null object
Alter 3134 non-null int64
Sex 3134 non-null object
dtypes: datetime64[ns](1), float64(3), int64(1), object(9)
memory usage: 342.9+ KB
执行这些步骤后,文件将具有更多的标头,而没有其他文件中的数据。我也尝试加入,但效果不佳。
我不知道该如何将两者结合起来。
答案 0 :(得分:0)
sql.["ID of Sample Card"]
和emea.["Barcode"]
均为object
数据类型。我无法从原始问题中的样本数据中确定它们是否具有前导或尾随空格,但是即使数据看起来相同,也可能使两个数据框的合并变得混乱。
如果您确信两列都是数字列和非空列,则可以使用astype
将它们转换为整数,但是您可能需要首先清理数据。例如:
sql["ID of Sample Card"] = sql["ID of Sample Card"].str.strip().astype('int')
emea["Barcode"] = emea["Barcode"].str.strip().astype('int')
答案 1 :(得分:0)
问题是两个系列的对象类型都在哪里。
将两个系列都转换为整数
sql["ID of Sample Card"] = pd.to_numeric(sql["ID of Sample Card"], errors="coerce", downcast="integer")
emea["Barcode"] = pd.to_numeric(emea["Barcode"], errors="coerce", downcast="integer")
之后,我可以毫无问题地合并它们
data = pd.merge(left=sql, right=emea, left_on="ID of Sample Card", right_on="Barcode", how="left")
与以上答案的区别在于,该系列中的所有非数字字段均为NaN