我有一个包含数百万字的字符串,我希望有一个正则表达式可以返回任何美元符号周围的五个单词。例如:
string = 'I have a sentence with $10.00 within it and this sentence is done. '
我希望正则表达式返回
surrounding = ['I', 'have', 'a', 'sentence', 'with', 'within', 'it', 'and', 'this', 'sentence']
我的最终目标是计算所有围绕提及'$'的单词,以便上面的列表可以完成:
final_return = [('I', 1), ('have', 1), ('a', 1), ('sentence', 2), ('with', 1), ('within', 1), ('it', 1), ('and', 1), ('this', 1)]
我到目前为止开发的正则表达式可以返回附加到货币符号的字符串,周围有5个字符。有没有办法编辑正则表达式来捕获周围的五个单词?我(如果是这样,如何)使用NLTK的标记器来实现这一目标吗?
import re
.....\$\s?\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{1,2})?.....
答案 0 :(得分:1)
使用拆分来拆分单词,用isalpha删除非单词,然后计算列表中单词的频率。
string='I have a sentence with $10.00 within it and this sentence is done. '
string1=string.split()
string2=[s for s in string1 if s.isalpha()]
[[x,string2.count(x)] for x in set(string2)]
#[['and', 1], ['within', 1], ['sentence', 2], ['it', 1], ['a', 1], ['have', 1], ['with', 1], ['this', 1], ['is', 1], ['I', 1]]
答案 1 :(得分:0)
你可以开始使用下面的代码,我试图以更简单的方式解决它。
import re
string = 'I have a sentence with $10.00 within it and this sentence is done. '
surrounding = re.search(r'(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*\$\d+\.?\d{2}?\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)\s*(\w+)', string, flags=0).groups()
print(surrounding )
答案 2 :(得分:0)
我不认为正则表达式是否是这个问题的正确选择。相反,你可以提取围绕一个美元符号的所有10个单词循环,然后在找到匹配时跟踪五个先前遍历的单词。
在这种情况下,您可以使用collections.deque()
这是一个具有有限数量项目的正确数据结构,以保留前五个单词。然后,您可以使用collections.Counter()
对象返回阈值内的单词计数器。
from collections import deque
from collections import Counter
from itertools import chain
def my_counter(string):
container = deque(maxlen=5)
words = iter(string.split())
def next_five(words):
for _ in range(5):
try:
yield next(words)
except StopIteration:
pass
for w in words:
if w.startswith('$'):
yield Counter(chain(container, next_five(words)))
else:
container.append(w)
演示:
In [8]: s = ' extra1 extra2 I have a sentence with $10.00 within it and this sentence is done.asdf asdf a b c d e $5 k j n m k gg ee'
In [9]:
In [9]: list(my_counter(s))
Out[9]:
[Counter({'I': 1,
'a': 1,
'and': 1,
'have': 1,
'it': 1,
'sentence': 2,
'this': 1,
'with': 1,
'within': 1}),
Counter({'a': 1,
'b': 1,
'c': 1,
'd': 1,
'e': 1,
'j': 1,
'k': 2,
'm': 1,
'n': 1})]
答案 3 :(得分:0)
您可以将正则表达式与计数器结合使用,如下所示:
(?P<before>(?:\w+\W+){5})
\$\d+(?:\.\d+)?
(?P<after>(?:\W+\w+){5})
<小时/>
在Python
:
from collections import Counter
import re
rx = re.compile(r'''
(?P<before>(?:\w+\W+){5})
\$\d+(?:\.\d+)?
(?P<after>(?:\W+\w+){5})
''', re.VERBOSE)
sentence = 'I have a sentence with $10.00 within it and this sentence is done. '
words = [Counter(m.group('before').split() + m.group('after').split())
for m in rx.finditer(sentence)]
print(words)
<小时/> 这会产生(请注意,
Counter
已经是dict
):
[Counter({'sentence': 2, 'I': 1, 'have': 1, 'a': 1, 'with': 1, 'within': 1, 'it': 1, 'and': 1, 'this': 1})]