我有一个庞大的数据框。数据框有列患者.drug。此列包含字典列表作为其元素。 我想过滤掉所有在patient.drug 列中包含“NIFEDIPINE”字样的行。
数据框非常大。这是一个示例。
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
到目前为止,我已经尝试过
df[df['patient.drug'].str.contains('NIFEDIPINE')]
但它给了我一个错误。
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n ...\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],\n dtype='float64', length=12000)] are in the [columns]"
我也尝试过使用 in
运算符并遍历行。
lst=[]
for i in range(len(df)):
if 'NIFEDIPINE' in df.loc[i, "patirnt.drug"]:
lst.append(i)
print(lst)
这也给了我一个错误。 我该怎么做才能做到正确?
答案 0 :(得分:1)
假设你有这样的列布局:
在第 2 个和第 4 个条目中找到的搜索字符串 'NIFEDIPINE':
data = {'patient.drug':
[[{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}],
[{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}],
[{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}],
[{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}],
]
}
df = pd.DataFrame(data)
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}] <=== keyword here
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}] <=== keyword here
(从你之前的问题中挖掘出来的布局)
解决方案:
[更新为 1) 支持列表中的多个 dict 和 2) 部分字符串匹配]。
使用:.loc
+ .explode()
+ .apply()
:
keyword = 'NIFEDIPINE'
df.loc[df['patient.drug'].explode().apply(lambda d: keyword in ' '.join(d.values())).any(level=0)]
结果:
正确提取并显示带有关键字字符串“NIFEDIPINE”的行:
patient.drug
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
答案 1 :(得分:1)
复制数据后,
>>> df
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
4 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
在使用您的代码时:
>>> df[df['patient.drug'].str.contains('NIFEDIPINE')]
错误:
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan], dtype='float64')] are in the [columns]"
>>> df[df['patient.drug'].astype('str').str.contains('NIFEDIPINE')]
patient.drug
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
4 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
注意:
这是由于在 pandas indexer
部分进行 indexer.py
检查而引发的问题,如下所示:
--> pandas/core/indexing.py
# Count missing values:
missing_mask = indexer < 0
missing = (missing_mask).sum()
if missing:
if missing == len(indexer):
axis_name = self.obj._get_axis_name(axis)
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
# We (temporarily) allow for some missing keys with .loc, except in
# some cases (e.g. setting) in which "raise_missing" will be False
答案 2 :(得分:0)
不清楚您的列的每个元素是字典列表还是只是字典。无论如何,我已经为这两种情况提供了解决方案。
import pandas as pd
a = [1, 2, 3, 4, 6]
b = [{'a':'A'}, {'b':'B'}, {'c':'C'}, {'d':'D'}, {'e':'E'}]
df = pd.DataFrame({'A': a, 'B': b})
df[df['B'].apply(lambda x: 'a' in x)]
这给出了输出:
A B
1 {'a': 'A'}
就你而言
df[df['B'].apply(lambda x: 'NIFEDIPINE' in x)]
import pandas as pd
a = [1, 2, 3, 4, 6]
b = [[{'a':'A'}], [{'b':'B'}], [{'c':'C'}], [{'d':'D'}], [{'e':'E'}]]
df = pd.DataFrame({'A': a, 'B': b})
def check(key, dict_list):
for map in dict_list:
if key in map:
return True
return False
df[df['B'].apply(lambda x: check('a', x))]
答案 3 :(得分:0)
您可以使用isin
drug_name = ['NIFEDIPINE']
df_NIFEDIPINE = df[df['patient.drug'].isin(drug_name)].reset_index()