Pandas Dataframe:计算列中的唯一单词并返回另一列

时间:2018-06-12 15:47:15

标签: python pandas dataframe text

我有一个数据框,其中包含以下列

  1. df ['Album'](包含artistX的专辑名称)
  2. df ['Tracks'](包含artistX专辑中的曲目)
  3. df ['Lyrics'](包含歌曲的歌词)
  4. 我正在尝试计算df ['Lyrics']中的单词数量并返回一个名为df ['wordcount']的新列,并计算df ['Lyrics']中唯一单词的数量并返回一个新的名为df ['uniquewordcount']的列。

    我通过计算df ['lyrics']中的每个字符串减去空格来获得df ['wordcount']。

    totalscore = df.Lyrics.str.count('[^\s]') #count every word in a track df['wordcount'] = totalscore df

    我已经能够计算df ['Lyrics']中的独特单词

    import collections
    from collections import Counter
    
    results = Counter()
    count_unique = df.Lyrics.str.lower().str.split().apply(results.update)
    unique_counts = sum((results).values())
    df['uniquewordcount'] = unique_counts
    

    这给了我df ['Lyrics']中所有独特单词的计数,这就是代码的意图,但是我想要每个音轨的歌词中的独特单词,我的python不是目前很好,所以解决方案可能对每个人都很明显,但不是我。我希望有人能指出我如何获得每首曲目的独特单词的计数。

    预期产出:

    Album    Tracks    Lyrics                      wordcount  uniquewordcount
     A         Ball   Ball is life and Ball is key       7           5
               Pass   Pass me the hookah Pass me the     7           4
    

    我得到了什么:

    Album    Tracks    Lyrics                    wordcount  uniquewordcount
      A     Ball   Ball is life and Ball is key       7           9
            Pass   Pass me the hookah Pass me the     7           9
    

2 个答案:

答案 0 :(得分:2)

仅使用标准库,您确实可以使用collections.Counter。但是,ntlk是可取的,因为有许多边缘情况可能会让您感兴趣,例如处理标点符号,复数等等。

以下是Counter的分步指南。请注意,我们在此处比需要的更远,因为我们还计算每个单词的计数。当我们删除Counter时,将丢弃df['LyricsCounter']字典中保存的数据。

from collections import Counter

df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
                              'Lyrics abound lyrics here there eveywhere',
                              'Come fly come fly away']})

# convert to lowercase, split to list
df['LyricsList'] = df['Lyrics'].str.lower().str.split()

# for each set of lyrics, create a Counter dictionary
df['LyricsCounter'] = df['LyricsList'].apply(Counter)

# calculate length of list
df['LyricsWords'] = df['LyricsList'].apply(len)

# calculate number of Counter items for each set of lyrics
df['LyricsUniqueWords'] = df['LyricsCounter'].apply(len)

res = df.drop(['LyricsList', 'LyricsCounter'], axis=1)

print(res)

                                       Lyrics  LyricsWords  LyricsUniqueWords
0  This is some life some collection of words            8                  7
1   Lyrics abound lyrics here there eveywhere            6                  5
2                      Come fly come fly away            5                  3

答案 1 :(得分:2)

以下是另一种解决方案:

import pandas as pd

df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
                              'Lyrics abound lyrics here there eveywhere',
                              'Come fly come fly away']})

# Split list into new series
lyrics = df['Lyrics'].str.lower().str.split()

# Get amount of unique words
df['LyricsCounter'] = lyrics.apply(set).apply(len)

# Get amount of words
df['LyricsWords'] = lyrics.apply(len)

print(df)

返回:

                                       Lyrics  LyricsCounter  LyricsWords
0  This is some life some collection of words              7            8
1   Lyrics abound lyrics here there eveywhere              5            6
2                      Come fly come fly away              3            5