组合csv文件中的冗余术语,并在python中从文件中删除停用词后添加其频率

时间:2018-05-22 23:00:46

标签: python-3.x pandas nltk

我有一个csv文件,看起来像这样 -

    Words           Author              Frequency
    #NAME?          Pandey P            4
    OF              Hamzad Ali          135
    OF              Karen Sara          80
    A               Hamzad Ali          69
    AND             Hamzad Ali          67
    OF              Pandey P            67
    HIV-1           Hamzad Ali          49
    AND             Karen Sara          45
    IN              Hamzad Ali          44
    OF              John christopher    44
    IN              John christopher    40
    INHIBITORS      Hamzad Ali          39
    THE             Karen Sara          39
    INTEGRASE       Hamzad Ali          38
    VIRUS           Karen Sara          38
    C               Karen Sara          35
    THE             Hamzad Ali          35
    HEPATITIS       Karen Sara          34
    THE             Pandey P            34
    IN              Karen Sara          33
    KINASE          Pandey P            31
    THE             John christopher    31
    AND             Pandey P            28
    INHIBITOR       Hamzad Ali          26
    POLYMERASE      Karen Sara          26
    AND             John christopher    25
    IN              Pandey P            25
    TO              Hamzad Ali          25
    WITH            Karen Sara          25
    FOR             Hamzad Ali          23
    HCV             Karen Sara          23
    NS5B            Karen Sara          23
    HIV             Hamzad Ali          22
    NOVEL           Hamzad Ali          22
    WITH            Hamzad Ali          22
    A               Karen Sara          21
    OF              Lieberman La        21
    INHIBITOR       Karen Sara          20
    PROTEIN         Pandey P            20
    BY              Hamzad Ali          19
    INHIBITORS      Karen Sara          19
    OF              Oslund Rc           19
    OF              Wyche Tp            19
    VIRUS           Hamzad Ali          19
    HUMAN           Hamzad Ali          18
    OF              Danilchanka O       18
    OF              Hett E              17
    OF              Sana Tr             17
    A               Wyche Tp            16
    ACTIVITY        Hamzad Ali          16
    AND             Roberts L           16
    GENE            John christopher    16
    OF              Fadeyi O            16
    AND             Sana Tr             15
    OF              Roberts L           15
    RESISTANCE      Hamzad Ali          15
    REVERSE         Hamzad Ali          15
    TRANSCRIPTASE   Hamzad Ali          15
    ACID            Hamzad Ali          14
    ACTIVATION      Pandey P            14
    BY              Pandey P            14
    IN              Lieberman La        14
    PROTEASE        Karen Sara          14
    1               Hamzad Ali          13
    ANTAGONISTS     Hamzad Ali          13
    CCR5            Hamzad Ali          13
    EXPRESSION      John christopher    13
    FOR             Karen Sara          13
    HEPATITIS       Hamzad Ali          13
    IN              White Ch            13
    INFECTION       Hamzad Ali          13
    HEPATITIS       John christopher    

我想合并所有冗余术语,并按作者分开计数。例如,我希望输出是这样的 -

    Words          Pandey P    Hamzad Ali    Karen Saha     John christopher  ..   
    HEPATITIS      47          38            32              28               ..      
    INHIBITORS     0           34            22               5       
    KINASE         45          5             0                0                ..
    HIV-1          40          35            11               25               ..      
    ... 

另外,我想用删除的英语停用词来获取此输出。 从编码的角度来看,我不知道该怎么做。任何帮助将不胜感激。谢谢。

1 个答案:

答案 0 :(得分:1)

涉及的步骤 1.停止言语 2.制作虚拟列以保持小写 3.删除停用词中的记录 4. group by Words将索引设置为单词作为Author 5.拆开每个组并保持频率

我试过了,

#snippet to remove stop words
stopwords_english= set(stopwords.words('english'))
df['dummy']=df['Words'].str.lower()
df=df[~df['dummy'].isin(stopwords_english)]
del df['dummy']

#snippet to get your desire result
df.groupby(['Words']).apply(lambda x:x.set_index(['Words','Author']).unstack()['Frequency'])