我有一个文件,我试图获取短语的数量。在某些文本行中我需要计算大约100个短语。举个简单的例子,我有以下几点:
phrases = """hello
name
john doe
"""
text1 = 'id=1: hello my name is john doe. hello hello. how are you?'
text2 = 'id=2: I am good. My name is Jane. Nice to meet you John Doe'
header = ''
for phrase in phrases.splitlines():
header = header+'|'+phrase
header = 'id'+header
我希望能够得到如下输出:
id|hello|name|john doe
1|3|1|1
2|0|1|1
我有标题。我只是不确定如何计算每个短语并附加输出。
答案 0 :(得分:3)
创建标题列表
In [6]: p=phrases.strip().split('\n')
In [7]: p
Out[7]: ['hello', 'name', 'john doe']
使用word-boundaries使用正则表达式,即\b
来获取避免部分匹配的出现次数。标志re.I
是为了使搜索不区分大小写。
In [11]: import re
In [14]: re.findall(r'\b%s\b' % p[0], text1)
Out[14]: ['hello', 'hello', 'hello']
In [15]: re.findall(r'\b%s\b' % p[0], text1, re.I)
Out[15]: ['hello', 'hello', 'hello']
In [16]: re.findall(r'\b%s\b' % p[1], text1, re.I)
Out[16]: ['name']
In [17]: re.findall(r'\b%s\b' % p[2], text1, re.I)
Out[17]: ['john doe']
在其周围放置len()
以获取找到的模式数量。
答案 1 :(得分:2)
您可以使用.count()
>>> text1.lower().count('hello')
3
所以这应该有用(除了以下评论中提到的不匹配)
phrases = """hello
name
john doe
"""
text1 = 'id=1: hello my name is john doe. hello hello. how are you?'
text2 = 'id=2: I am good. My name is Jane. Nice to meet you John Doe'
texts = [text1,text2]
header = ''
for phrase in phrases.splitlines():
header = header+'|'+phrase
header = 'id'+header
print header
for id,text in enumerate(texts):
textcount = [id]
for phrase in header.split('|')[1:]:
textcount.append(text.lower().count(phrase))
print "|".join(map(str,textcount))
以上假设您按照id
的顺序列出了文本列表,但如果它们都以'id=n'
开头,您可以执行以下操作:
for text in texts:
id = text[3] # assumes id is 4th char
textcount = [id]
答案 2 :(得分:0)
虽然它没有回答你的问题(@askewchan和@Fredrik已经这样做了),但我想我会就你的其他方法提出一些建议:
通过在列表中定义短语可能会更好:
phrases = ['hello', 'name', 'john doe']
然后让你跳过创建标题的循环:
header = 'id|' + '|'.join (phrases)
你可以省略askewchan答案中的.split ('|')[1:]
部分,例如,赞成for phrase in phrases:
答案 3 :(得分:0)
phrases = """hello
name
john doe
"""
text1 = 'id=1: hello my name is john doe. hello hello. how are you?'
text2 = 'id=2: I am good. My name is Jane. Nice to meet you John Doe'
import re
import collections
txts = [text1, text2]
phrase_list = phrases.split()
print "id|%s" % "|".join([ p for p in phrase_list])
for txt in txts:
(tid, rest) = re.match("id=(\d):\s*(.*)", txt).groups()
counter = collections.Counter(re.findall("\w+", rest))
print "%s|%s" % ( tid, "|".join([str(counter.get(p, 0)) for p in phrase_list]))
给出:
id|hello|name|john|doe
1|3|1|1|1
2|0|1|0|0