在字典中用完全匹配替换字符串中的单词

时间:2016-06-16 11:48:58

标签: regex python-3.x

text = "One sentence with one (two) three, but mostly one. And twos."

期望的结果:A sentence with A (B) C, but mostly A. And twos.

应根据lookup_dict中的完全匹配替换单词。因此,不应替换 twos 中的两个,因为该单词中还有一个字母。然而,空格,逗号,paranthesis和句号旁边的单词应该被替换。

lookup_dict = {'var': ["one", "two", "three"]}
match_dict = {'var': ["A", "B", "C"]}

var_dict = {}

for i,v in enumerate(lookup_dict['var']):
    var_dict[v] = match_dict['var'][i]
    xpattern = re.compile('|'.join(var_dict.keys()))
    result = xpattern.sub(lambda x: var_dict[x.group()], text.lower())

结果:A sentence with A (B) C, but mostly A. and Bs.

我是否可以在不添加词典+相邻字符的所有组合的情况下实现所需的输出?这似乎不必要地复杂化了:

lookup_dict = {'var':['one ', 'one,', '(one)', 'one.', 'two ', 'two,', '(two)', 'two.', 'three ', 'three,', '(three)' 'three.']
...
result = xpattern.sub(lambda x: var_dict[x.group()] if x.group() in lookup_dict['var'] else x.group(), text.lower()) 

2 个答案:

答案 0 :(得分:4)

w = "Where are we one today two twos them"
lookup_dict = {"one":"1", "two":"2", "three":"3"}
pattern = re.compile(r'\b(' + '|'.join(lookup_dict.keys()) + r')\b')
output = pattern.sub(lambda x: lookup_dict[x.group()],w)

这将打印出来'我们今天在哪里2 2他们'

基本上,

我更新了你的字典,以便为每个条目使用密钥。

创建一个正则表达式,它基本匹配字典中的任何项目,使用正则表达式\ b(每个|键| in | your |字典)\ b来匹配项目a,b,c。并使用它周围的单词边界来匹配任何不属于单词的东西。即空间,插入符号等。

然后使用该模式,替换所有发生的匹配

答案 1 :(得分:0)

好的,终于完成了解决方案!这是非常冗长的,我不会让它照顾我的孩子,但无论如何它在这里。另一个答案可能是更好的解决方案:)

首先,有一种更好的方式来表示您想要替换的替换词:

lookup_dict = {"one": "A", "two": "B", "three": "C"}

看起来你真正想要的是匹配整个单词但忽略标点符号和大小写。为此,我们可以在每次尝试匹配时从字符串中去除标点符号,然后使用字母“A”而不是“one”等重新构造原始单词。

import re

text = "One sentence with one (two) three, but mostly one. And twos."

lookup_dict = {"one": "A", "two": "B", "three": "C"}

# Make a regex for only letters.
regex = re.compile('[^a-zA-Z]')

textSplit = text.split()

for i in range(0, len(textSplit)):
    # Get rid of punctuation.
    word = regex.sub('', textSplit[i]).lower()
    if word in lookup_dict:
        # Fetch the right letter from the lookup_dict.
        letter = lookup_dict[word]
        # Find where the word is in the punctuated string (super flakey I know).
        wInd = textSplit[i].find(word)
        # Just making sure the word needs to be reconstructed at all.
        if wInd != -1:
            # Rebuilding the string with punctuation.
            newWord = textSplit[i][0:wInd] + letter + textSplit[i][wInd+len(word):]
            textSplit[i] = newWord

print(" ".join(textSplit))

我知道这不是一个很好的解决方案但我已经完成了。把它当作一点乐趣所以请不要downvotes哈哈。