我有一个巨大的会话文本文件(文本块),我想将重复的短语(多个单词)提取到另一个文本文件中,并按频率排序
输入: text block, single line,Word Wrapped
输出:
I don't know 7345
I want you to 5312
amazing experience 625
我正在寻找python脚本
我已经尝试过该脚本,但是我只能得到一个单词,从出现的最高到出现的顺序排序
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
ipy.run_line_magic('matplotlib', 'inline')
import collections
import pandas as pd
import matplotlib.pyplot as plt
# Read input file, note the encoding is specified here
# It may be different in your text file
file = open('test2.txt', encoding="utf8")
a= file.read()
# Stopwords
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['mr','mrs','one','two','said']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in a.lower().split():
word = word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace("\"","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
print(word, ": ", count)
# Close the file
file.close()
# Create a data frame of the most common words
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')
答案 0 :(得分:0)
我认为您可以使用nltk软件包中的nltk.ngrams。
text = 'I have been I have I like this I have never been.'
ngrams = tuple(nltk.ngrams(text.split(' '), n=2))
ngrams_count = {i : ngrams.count(i) for i in ngrams}
出局:
{('I', 'have'): 3, ('have', 'been'): 1, ('been', 'I'): 1,
('have', 'I'): 1, ('I', 'like'): 1, ('like', 'this'): 1,
('this', 'I'): 1, ('have', 'never'): 1, ('never', 'been.'): 1}
然后您可以使用pandas / txt / json等保存它。
您可以在n
中更改nltk.ngrams
,您的ngram就是另一个长度。
可以对此进行修改:
text = 'I have been I have I like this I have never been.'
lenght = [2, 3, 4]
ngrams_count = {}
for n in lenght:
ngrams = tuple(nltk.ngrams(text.split(' '), n=n))
ngrams_count.update({' '.join(i) : ngrams.count(i) for i in ngrams})
df = pd.DataFrame(list(zip(ngrams_count, ngrams_count.values())),
columns=['Ngramm', 'Count']).sort_values(['Count'],
ascending=False)
出局:
Ngramm Count
0 I have 3
1 have been 1
26 this I have never 1
25 like this I have 1
24 I like this I 1
23 have I like this 1
22 I have I like 1
21 been I have I 1
20 have been I have 1
19 I have been I 1
18 have never been. 1
17 I have never 1
...
现在,我们可以输入n,然后创建一个排序的数据框。如果需要,可以将其另存为df.to_csv('file_name.csv')
,也可以先保存为df.head(10)
。
要使用此解决方案,您应该安装nltk和pandas。
答案 1 :(得分:0)
您可以只使用str.count()
并计算字符串中的短语
s = 'vash the vash the are you is he where did where did'
print('the how: {}'.format(s.count('where did')))
print('vash the: {}'.format(s.count('vash the')))
the how: 2 vash the: 2