Question

我有一个数据框，其中包含以下列

df ['Album']（包含artistX的专辑名称）
df ['Tracks']（包含artistX专辑中的曲目）
df ['Lyrics']（包含歌曲的歌词）

我正在尝试计算df ['Lyrics']中的单词数量并返回一个名为df ['wordcount']的新列，并计算df ['Lyrics']中唯一单词的数量并返回一个新的名为df ['uniquewordcount']的列。

我通过计算df ['lyrics']中的每个字符串减去空格来获得df ['wordcount']。

totalscore = df.Lyrics.str.count('[^\s]') #count every word in a track df['wordcount'] = totalscore df

我已经能够计算df ['Lyrics']中的独特单词

import collections
from collections import Counter

results = Counter()
count_unique = df.Lyrics.str.lower().str.split().apply(results.update)
unique_counts = sum((results).values())
df['uniquewordcount'] = unique_counts

这给了我df ['Lyrics']中所有独特单词的计数，这就是代码的意图，但是我想要每个音轨的歌词中的独特单词，我的python不是目前很好，所以解决方案可能对每个人都很明显，但不是我。我希望有人能指出我如何获得每首曲目的独特单词的计数。

预期产出：

Album    Tracks    Lyrics                      wordcount  uniquewordcount
 A         Ball   Ball is life and Ball is key       7           5
           Pass   Pass me the hookah Pass me the     7           4

我得到了什么：

Album    Tracks    Lyrics                    wordcount  uniquewordcount
  A     Ball   Ball is life and Ball is key       7           9
        Pass   Pass me the hookah Pass me the     7           9

Answer 1

仅使用标准库，您确实可以使用collections.Counter。但是，ntlk是可取的，因为有许多边缘情况可能会让您感兴趣，例如处理标点符号，复数等等。

以下是Counter的分步指南。请注意，我们在此处比需要的更远，因为我们还计算每个单词的计数。当我们删除Counter时，将丢弃df['LyricsCounter']字典中保存的数据。

from collections import Counter

df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
                              'Lyrics abound lyrics here there eveywhere',
                              'Come fly come fly away']})

# convert to lowercase, split to list
df['LyricsList'] = df['Lyrics'].str.lower().str.split()

# for each set of lyrics, create a Counter dictionary
df['LyricsCounter'] = df['LyricsList'].apply(Counter)

# calculate length of list
df['LyricsWords'] = df['LyricsList'].apply(len)

# calculate number of Counter items for each set of lyrics
df['LyricsUniqueWords'] = df['LyricsCounter'].apply(len)

res = df.drop(['LyricsList', 'LyricsCounter'], axis=1)

print(res)

                                       Lyrics  LyricsWords  LyricsUniqueWords
0  This is some life some collection of words            8                  7
1   Lyrics abound lyrics here there eveywhere            6                  5
2                      Come fly come fly away            5                  3

Answer 2

以下是另一种解决方案：

import pandas as pd

df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
                              'Lyrics abound lyrics here there eveywhere',
                              'Come fly come fly away']})

# Split list into new series
lyrics = df['Lyrics'].str.lower().str.split()

# Get amount of unique words
df['LyricsCounter'] = lyrics.apply(set).apply(len)

# Get amount of words
df['LyricsWords'] = lyrics.apply(len)

print(df)

返回：

                                       Lyrics  LyricsCounter  LyricsWords
0  This is some life some collection of words              7            8
1   Lyrics abound lyrics here there eveywhere              5            6
2                      Come fly come fly away              3            5

Pandas Dataframe：计算列中的唯一单词并返回另一列

2 个答案: