我正在尝试执行一个函数,允许在文本中找到(整个)单词的出现次数(不区分大小写)。
示例:
>>> text = """Antoine is my name and I like python.
Oh ! your name is antoine? And you like Python!
Yes is is true, I like PYTHON
and his name__ is John O'connor"""
assert( 2 == Occs("Antoine", text) )
assert( 2 == Occs("ANTOINE", text) )
assert( 0 == Occs("antoin", text) )
assert( 1 == Occs("true", text) )
assert( 0 == Occs("connor", text) )
assert( 1 == Occs("you like Python", text) )
assert( 1 == Occs("Name", text) )
这是一个基本的尝试:
def Occs(word,text):
return text.lower().count(word.lower())
这个不起作用,因为它不是基于单词 这个功能必须快,文字可以很大。
我应该将它拆分成阵列吗? 有没有简单的方法来执行此功能?
编辑(python 2.3.4)
答案 0 :(得分:7)
from collections import Counter
import re
Counter(re.findall(r"\w+", text))
或,对于不区分大小写的版本
Counter(w.lower() for w in re.findall(r"\w+", text))
在Python< 2.7中,使用defaultdict
代替Counter
:
freq = defaultdict(int)
for w in re.findall(r"\w+", text):
freq[w.lower()] += 1
答案 1 :(得分:2)
这是一种非pythonic方式 - 我假设这是一个家庭作业问题...
def count(word, text):
result = 0
text = text.lower()
word = word.lower()
index = text.find(word, 0)
while index >= 0:
result += 1
index = text.find(word, index)
return result
当然,对于非常大的文件,这主要是因为text.lower()
调用而变慢。但是你总是可以提出一个不区分大小写的find
并修复它!
为什么我这样做?因为我认为它捕获了您最想要做的事情:浏览text
,计算您在其中找到word
的次数。
此外,这些方法解决了标点符号的一些令人讨厌的问题:split
会将它们留在那里而你将无法匹配,是吗?
答案 2 :(得分:1)
感谢您的帮助 这是我的解决方案:
import re
starte = "(?<![a-z])((?<!')|(?<=''))"
ende = "(?![a-z])((?!')|(?=''))"
def NumberOfOccurencesOfWordInText(word, text):
"""Returns the nb. of occurences of whole word(s) (case insensitive) in a text"""
pattern = (re.match('[a-z]', word, re.I) != None) * starte\
+ word\
+ (re.match('[a-z]', word[-1], re.I) != None) * ende
return len(re.findall(pattern, text, re.IGNORECASE))
答案 3 :(得分:0)
请参阅this question。
一个实现是,如果您的文件是面向行的,那么逐行读取它并在每行上使用普通split()
将不会非常昂贵。这当然假设单词不会跨越换行符,不知何故(没有连字符)。
答案 4 :(得分:0)
我得到了完全相同的问题需要解决,因此对这个问题进行了大量的讨论。这就是为什么想在这里分享我的解决方案。虽然我的解决方案需要一段时间才能执行,但它的内部处理时间比我想的更好。我可能错了。无论如何这里解决方案:
def CountOccurencesInText(word,text):
"""Number of occurences of word (case insensitive) in text"""
acceptedChar = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '-', ' ')
for x in ",!?;_\n«»():\".":
if x == "\n" or x == "«" or x == "»" or x == "(" or x == ")" or x == "\"" or x == ":" or x == ".":
text = text.replace(x," ")
else:
text = text.replace(x,"")
"""this specifically handles the imput I am attaching my c.v. to this e-mail."""
if len(word) == 32:
for x in ".":
word = word.replace(x," ")
punc_Removed_Text = ""
text = text.lower()
for i in range(len(text)):
if text[i] in acceptedChar:
punc_Removed_Text = punc_Removed_Text + text[i]
""""this specifically handles the imput: Do I have to take that as a 'yes'"""
elif text[i] == '\'' and text[i-1] == 's':
punc_Removed_Text = punc_Removed_Text + text[i]
elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] in acceptedChar:
punc_Removed_Text = punc_Removed_Text + text[i]
elif text[i] == '\'' and text[i-1] == " " and text[i+1] in acceptedChar:
punc_Removed_Text = punc_Removed_Text + text[i]
elif text[i] == '\'' and text[i-1] in acceptedChar and text[i+1] == " " :
punc_Removed_Text = punc_Removed_Text + text[i]
frequency = 0
splitedText = punc_Removed_Text.split(word.lower())
for y in range(0,len(splitedText)-1,1):
element = splitedText[y]
if len(element) == 0:
if(splitedText[y+1][0] == " "):
frequency += 1
elif len(element) == 0:
if(len(splitedText[y+1][0])==0):
frequency += 1
elif len(splitedText[y+1]) == 0:
if(element[len(element)-1] == " "):
frequency += 1
elif (element[len(element)-1] == " " and splitedText[y+1][0] == " "):
frequency += 1
return frequency
以下是个人资料:
128006 function calls in 7.831 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.831 7.831 :0(exec)
32800 0.062 0.000 0.062 0.000 :0(len)
11200 0.047 0.000 0.047 0.000 :0(lower)
1 0.000 0.000 0.000 0.000 :0(print)
72800 0.359 0.000 0.359 0.000 :0(replace)
1 0.000 0.000 0.000 0.000 :0(setprofile)
5600 0.078 0.000 0.078 0.000 :0(split)
1 0.000 0.000 7.831 7.831 <string>:1(<module>)
1 0.000 0.000 7.831 7.831 ideone-gg.py:225(doit)
5600 7.285 0.001 7.831 0.001 ideone-gg.py:3(CountOccurencesInText)
1 0.000 0.000 7.831 7.831 profile:0(doit())
0 0.000 0.000 profile:0(profiler)