我有一个csv文件,看起来像这样 -
Words Author Frequency
#NAME? Pandey P 4
OF Hamzad Ali 135
OF Karen Sara 80
A Hamzad Ali 69
AND Hamzad Ali 67
OF Pandey P 67
HIV-1 Hamzad Ali 49
AND Karen Sara 45
IN Hamzad Ali 44
OF John christopher 44
IN John christopher 40
INHIBITORS Hamzad Ali 39
THE Karen Sara 39
INTEGRASE Hamzad Ali 38
VIRUS Karen Sara 38
C Karen Sara 35
THE Hamzad Ali 35
HEPATITIS Karen Sara 34
THE Pandey P 34
IN Karen Sara 33
KINASE Pandey P 31
THE John christopher 31
AND Pandey P 28
INHIBITOR Hamzad Ali 26
POLYMERASE Karen Sara 26
AND John christopher 25
IN Pandey P 25
TO Hamzad Ali 25
WITH Karen Sara 25
FOR Hamzad Ali 23
HCV Karen Sara 23
NS5B Karen Sara 23
HIV Hamzad Ali 22
NOVEL Hamzad Ali 22
WITH Hamzad Ali 22
A Karen Sara 21
OF Lieberman La 21
INHIBITOR Karen Sara 20
PROTEIN Pandey P 20
BY Hamzad Ali 19
INHIBITORS Karen Sara 19
OF Oslund Rc 19
OF Wyche Tp 19
VIRUS Hamzad Ali 19
HUMAN Hamzad Ali 18
OF Danilchanka O 18
OF Hett E 17
OF Sana Tr 17
A Wyche Tp 16
ACTIVITY Hamzad Ali 16
AND Roberts L 16
GENE John christopher 16
OF Fadeyi O 16
AND Sana Tr 15
OF Roberts L 15
RESISTANCE Hamzad Ali 15
REVERSE Hamzad Ali 15
TRANSCRIPTASE Hamzad Ali 15
ACID Hamzad Ali 14
ACTIVATION Pandey P 14
BY Pandey P 14
IN Lieberman La 14
PROTEASE Karen Sara 14
1 Hamzad Ali 13
ANTAGONISTS Hamzad Ali 13
CCR5 Hamzad Ali 13
EXPRESSION John christopher 13
FOR Karen Sara 13
HEPATITIS Hamzad Ali 13
IN White Ch 13
INFECTION Hamzad Ali 13
HEPATITIS John christopher
我想合并所有冗余术语,并按作者分开计数。例如,我希望输出是这样的 -
Words Pandey P Hamzad Ali Karen Saha John christopher ..
HEPATITIS 47 38 32 28 ..
INHIBITORS 0 34 22 5
KINASE 45 5 0 0 ..
HIV-1 40 35 11 25 ..
...
另外,我想用删除的英语停用词来获取此输出。 从编码的角度来看,我不知道该怎么做。任何帮助将不胜感激。谢谢。
答案 0 :(得分:1)
涉及的步骤 1.停止言语 2.制作虚拟列以保持小写 3.删除停用词中的记录 4. group by Words将索引设置为单词作为Author 5.拆开每个组并保持频率
我试过了,
#snippet to remove stop words
stopwords_english= set(stopwords.words('english'))
df['dummy']=df['Words'].str.lower()
df=df[~df['dummy'].isin(stopwords_english)]
del df['dummy']
#snippet to get your desire result
df.groupby(['Words']).apply(lambda x:x.set_index(['Words','Author']).unstack()['Frequency'])