我必须计算文件中的所有单词并创建单词的直方图。我使用以下python代码。
for word in re.split('[,. ]',f2.read()):
if word not in histogram:
histogram[word] = 1
else:
histogram[word]+=1
f2是我正在阅读的文件。我试图用多个分隔符解析文件,但它仍然不起作用。它计算文件中的所有字符串并生成直方图,但我只想要单词。我得到这样的结果:
1-1-3: 3
其中" 1-1-3"是一个发生3次的字符串。如何检查以便只计算实际单词?套管无所谓。我还需要重复这个问题但是对于两个单词序列,所以输出看起来像:
and the: 4
其中"和"是一个出现4次的双字序列。如何将两个单词序列组合在一起进行计数?
答案 0 :(得分:1)
from collections import Counter
from nltk.tokenize import RegexpTokenizer
from nltk import bigrams
from string import punctuation
# preparatory stuff
>>> tokenizer = RegexpTokenizer(r'[^\W\d]+')
>>> my_string = "this is my input string. 12345 1-2-3-4-5. this is my input"
# single words
>>> tokens = tokenizer.tokenize(my_string)
>>> Counter(tokens)
Counter({'this': 2, 'input': 2, 'is': 2, 'my': 2, 'string': 1})
# word pairs
>>> nltk_bigrams = bigrams(my_string.split())
>>> bigrams_list = [' '.join(x).strip(punctuation) for x in list(nltk_bigrams)]
>>> Counter([x for x in bigrams_list if x.replace(' ','').isalpha()])
Counter({'is my': 2, 'this is': 2, 'my input': 2, 'input string': 1})
答案 1 :(得分:-1)
假设您想要计算字符串中的所有单词,您可以使用defaultdict
作为计数器来执行此类操作:
#!/usr/bin/env python3
# coding: utf-8
from collections import defaultdict
# For the sake of simplicty we are using a string instead of a read file
sentence = "The quick brown fox jumps over the lazy dog. THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG. The quick brown fox"
# Specify the word pairs you want to count as a single phrase
special_pairs = [('the', 'quick')]
# Convert sentence / input to lowercase in order to neglect case sensitivity and print lowercase sentence to double-check
sentence = sentence.lower()
print(sentence)
# Split string into single words
word_list = sentence.split(' ')
print(word_list)
# Since we know all the word in our input sentence we have to correct the word_list with our word pairs which need
# to be counted as a single phrase and not two single words
for pair in special_pairs:
for index, word in enumerate(word_list):
if pair[0] == word and pair[1] == word_list[index+1]:
word_list.remove(pair[0])
word_list.remove(pair[1])
word_list.append(' '.join([pair[0], pair[1]]))
d = defaultdict(int)
for word in word_list:
d[word] += 1
print(d.items())
输出:
the quick brown fox jumps over the lazy dog. the quick brown fox jumps over the lazy dog. the quick brown fox
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'the', 'quick', 'brown', 'fox']
dict_items([('lazy', 2), ('dog.', 2), ('fox', 3), ('brown', 3), ('jumps', 2), ('the quick', 3), ('the', 2), ('over', 2)])