我有以下代码,该代码读取一个csv文件,然后对其进行分析。一名患者患有多种疾病,我需要确定所有患者均患有疾病的次数。但是这里给出的查询
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
是如此缓慢,以至于需要超过15分钟的时间。有没有一种方法可以使查询更快?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
部分数据框:
原始数据
查找标签-患者ID
疾病A |疾病B-1
疾病A-2
答案 0 :(得分:0)
据我了解,ctr
代表疾病的名称。
执行此查询时:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
您不仅要过滤具有该疾病的行,而且还要过滤具有特定患者ID的行。如果您有很多病人,则需要多次查询。一种更简单的方法是不对患者编号进行过滤,然后对所有患有该疾病的行进行计数。 这将是:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
在这种情况下,由于您需要的是行数,因此您正在寻找的是len
而不是size
(大小将是数据框中的单元格数目)。
最后,当前代码中的另一个错误源是您没有保留每个患者ID的计数。您需要递增illnesses.at[index, "Count_of_Patientes_Having"]
而不是每次都将其设置为新值。
假设您想将疾病名称和索引分开,代码将类似于(对于最后几行):
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
我放任使用enumerate
来使用更Python的方式处理索引。我也不是很清楚"Count_of_Times_Being_Shown_In_An_Image"
是什么,但是我假设您在size
和len
之间有同样的困惑。
答案 1 :(得分:0)
就像您的代码之所以缓慢的原因是,您正在循环内逐行扩展数据帧,该循环可能涉及多个内存中复制。通常,这让人想起通用Python,而不是Pandas编程,后者理想地以块方式进行矢量化处理。
考虑将您的数据(假设数据大小合理)与疾病列表进行交叉连接,以将查找标签与同一行中的每种疾病对齐,如果较长的字符串包含较短的项,则将其过滤。然后,运行几个groupby()
以返回计数和病人的非重复计数。
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
为了演示,请在下面考虑带有随机种子的输入数据和输出。
输入数据(试图镜像原始数据)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
输出 (运行上述过程之后)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2