我有一个数据框,其中包含以下列
我正在尝试计算df ['Lyrics']中的单词数量并返回一个名为df ['wordcount']的新列,并计算df ['Lyrics']中唯一单词的数量并返回一个新的名为df ['uniquewordcount']的列。
我通过计算df ['lyrics']中的每个字符串减去空格来获得df ['wordcount']。
totalscore = df.Lyrics.str.count('[^\s]') #count every word in a track
df['wordcount'] = totalscore
df
我已经能够计算df ['Lyrics']中的独特单词
import collections
from collections import Counter
results = Counter()
count_unique = df.Lyrics.str.lower().str.split().apply(results.update)
unique_counts = sum((results).values())
df['uniquewordcount'] = unique_counts
这给了我df ['Lyrics']中所有独特单词的计数,这就是代码的意图,但是我想要每个音轨的歌词中的独特单词,我的python不是目前很好,所以解决方案可能对每个人都很明显,但不是我。我希望有人能指出我如何获得每首曲目的独特单词的计数。
预期产出:
Album Tracks Lyrics wordcount uniquewordcount
A Ball Ball is life and Ball is key 7 5
Pass Pass me the hookah Pass me the 7 4
我得到了什么:
Album Tracks Lyrics wordcount uniquewordcount
A Ball Ball is life and Ball is key 7 9
Pass Pass me the hookah Pass me the 7 9
答案 0 :(得分:2)
仅使用标准库,您确实可以使用collections.Counter
。但是,ntlk
是可取的,因为有许多边缘情况可能会让您感兴趣,例如处理标点符号,复数等等。
以下是Counter
的分步指南。请注意,我们在此处比需要的更远,因为我们还计算每个单词的计数。当我们删除Counter
时,将丢弃df['LyricsCounter']
字典中保存的数据。
from collections import Counter
df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
'Lyrics abound lyrics here there eveywhere',
'Come fly come fly away']})
# convert to lowercase, split to list
df['LyricsList'] = df['Lyrics'].str.lower().str.split()
# for each set of lyrics, create a Counter dictionary
df['LyricsCounter'] = df['LyricsList'].apply(Counter)
# calculate length of list
df['LyricsWords'] = df['LyricsList'].apply(len)
# calculate number of Counter items for each set of lyrics
df['LyricsUniqueWords'] = df['LyricsCounter'].apply(len)
res = df.drop(['LyricsList', 'LyricsCounter'], axis=1)
print(res)
Lyrics LyricsWords LyricsUniqueWords
0 This is some life some collection of words 8 7
1 Lyrics abound lyrics here there eveywhere 6 5
2 Come fly come fly away 5 3
答案 1 :(得分:2)
以下是另一种解决方案:
import pandas as pd
df = pd.DataFrame({'Lyrics': ['This is some life some collection of words',
'Lyrics abound lyrics here there eveywhere',
'Come fly come fly away']})
# Split list into new series
lyrics = df['Lyrics'].str.lower().str.split()
# Get amount of unique words
df['LyricsCounter'] = lyrics.apply(set).apply(len)
# Get amount of words
df['LyricsWords'] = lyrics.apply(len)
print(df)
返回:
Lyrics LyricsCounter LyricsWords
0 This is some life some collection of words 7 8
1 Lyrics abound lyrics here there eveywhere 5 6
2 Come fly come fly away 3 5