我有一个熊猫数据框,看起来像:
word | ss |
ABC | E |
DHS | H |
ABC | H |
ABC | C |
ABC | H |
SJA | C |
SJA | H |
但是我正在尝试做类似的事情:
word | H | E | C
ABC | 2 | 1 | 1
DHS | 1 | 0 | 0
SJA | 1 | 0 | 1
到目前为止我所做的:
helix = []
sheet = []
coil = []
for ss in big_ss_list: # this is the ss column in a list form.
counter = counter + 1
print ('position', counter, 'of', len(big_ss_list))
if ss == 'H':
sheet = sheet + [0]
helix = helix + [1]
coil = coil + [0]
elif ss == 'E':
sheet = sheet + [1]
helix = helix + [0]
coil = coil + [0]
elif ss == 'C':
sheet = sheet + [0]
helix = helix + [0]
coil = coil + [1]
fdf = pd.DataFrame(list(zip(big_gram_list, big_ss_list, helix, sheet, coil)), columns =['word', 'secondary_structure', 'helix', 'sheet', 'coil'])
fdf = fdf.iloc[(fdf.word).argsort()] # sort based on alphabetically order
fdf = fdf.groupby('word',as_index=False).sum()
但是此过程很长,因为大约有3,700万行,并且需要大量时间才能运行。
有更快的方法吗?