我在excel文件中提供了这样的数据,并使用pandas将其导入到程序中:
我需要遍历“ IPC”列中每一行的每个数据,并以它们的前4个字符(例如A61K038 / 51 => A61K)进行计数。但是大多数行包含多个数据,并且用分号分隔。
我的想法是首先对行进行迭代,然后再次对行的数据进行迭代。我知道如何在其他数据类型中执行这些操作,但是我对Pandas并不陌生,Pandas Dataframe使事情变得更加复杂!请帮忙!任何关于最佳方法的指导将不胜感激。
编辑:前20行
Company Name ... IPC
0 Phoenix Pharmacologics Inc ... A61K038/51;A61K038/21;A61K031/7076;A61K031/707...
1 Phoenix Pharmacologics Inc ... A61K038/46;C12N009/80
2 Phoenix Pharmacologics Inc ... A61K038/43
3 Phoenix Pharmacologics Inc ... A61K038/50;A61K045/06;A61K047/48
4 Phoenix Pharmacologics Inc ... A61K038/44;C12N009/06
5 Phoenix Pharmacologics Inc ... C07K014/525;C12N009/78;C12N015/81
6 Phoenix Pharmacologics Inc ... A61K038/00;C12N009/06
7 Phoenix Pharmacologics Inc ... C12Q001/68
8 Phoenix Pharmacologics Inc ... A61K038/50;C12N009/78
9 Phoenix Pharmacologics Inc ... C12N011/06;C12N009/96;C12N009/06;A61K038/44
10 Phoenix Pharmacologics Inc ... C12N009/14
11 Phoenix Pharmacologics Inc ... C12N011/06;C12N009/06;C12N009/96;C12N011/08
12 Phoenix Pharmacologics Inc ... A61K038/00;A61K047/48;C12N009/78;C12N009/96
13 Phoenix Pharmacologics Inc ... A61K038/00;C07K014/525
14 Phytoceutica, Inc ... A61K036/539;A61P035/00;A61K036/484;A61K036/725...
15 Phytoceutica, Inc ... A01N065/00
16 Phytoceutica, Inc ... A61K036/00
17 Phytoceutica, Inc ... G01N033/48;G06F017/00
18 Phytoceutica, Inc ... C12Q001/68;C12Q001/68;G06F019/00;G06F019/00
19 Phytoceutica, Inc ... G06F019/00
答案 0 :(得分:1)
如果您想基于前4个字符来计数元素,则可以定义一个函数来执行此操作,然后将其应用于数据框,如下所示:
import numpy as np
df = pd.DataFrame({'IPC': ['A61K038/52;A61K038/21', 'A61K038/46;C12N009/80']})
def count_ipc(ipc):
items = ipc.split(';')
items = [val[:4] for val in items] # extract first 4 elements
values = np.unique(items) # count unique elements with numpy
return len(items)
df['cnt'] = df.apply(lambda row: count_ipc(row['IPC']), axis=1)
结果是:
IPC cnt
0 A61K038/52;A61K038/21 1
1 A61K038/46;C12N009/80 2
答案 1 :(得分:0)
您可以使用熊猫pandas.Series.str.split
并将其链接pandas.Series.str.len
以获得结果:
示例数据
# Example dataframe
df = pd.DataFrame({'IPC':['A61K038/51;A61K038/21;A61k031', 'A80934;A758392']})
print(df)
IPC
0 A61K038/51;A61K038/21;A61k031
1 A80934;A758392
应用拆分并合并
df['count'] = df.IPC.str.split(';').str.len()
print(df)
IPC count
0 A61K038/51;A61K038/21;A61k031 3
1 A80934;A758392 2
答案 2 :(得分:0)
一个带有lambda的
df = pd.DataFrame({'IPC': ['A61K038/52;A61K038/21;A61K038', 'A61K038/46;C12N009/80']})
def counter(ipc):
temp = ipc.split(';')
first_4 = temp[0][:4]
return sum(1 for i in temp if i.startswith(first_4) )
df['cnt']= df['IPC'].apply(counter)
输出
IPC cnt
0 A61K038/52;A61K038/21;A61K038 3
1 A61K038/46;C12N009/80 1