Question

你好：有点像蟒蛇/编程新手。我试图找到每一个单词开始一个新的句子并替换它，在这种情况下是好老“鲍勃”，取而代之的是“约翰”。我正在使用字典和.replace()方法进行替换 - 用相关值替换字典键。这是我的代码：

start_replacements = {'. Bob': '. John',
                      '! Bob': '! John', 
                      '? Bob': '? John',
                      '\nBob': '\nJohn',
                      }

def search_and_replace(start_word, replacement):
    with open('start_words.txt', 'r+') as article:
        read_article = article.read()
        replaced = read_article.replace(start_word, replacement)
        article.seek(0)
        article.write(replaced)

def main():
    for start_word, replacement in start_replacements.iteritems():
        search_and_replace(start_word, replacement)


if __name__ == '__main__':
    main()

你会在字典中看到我有4种方法在一个句子的开头找到“Bob”，但我不知道如何在文本文件的最开头找到“Bob”，而不使用正则表达式{ {1}}。我宁愿避免使用正则表达式来保持这个脚本更简单。这可能吗？

编辑：运行脚本之前“start_words.txt”的内容：

运行脚本后的内容：

Bob is at the beginning of the file. Bob after period! Bob after exclamation? Bob after question.
Bob after newline.

编辑：不想正则表达式的解释：我更喜欢坚持使用字典，因为它会在每周增加新的单词和短语时增长。在这种情况下，它只是“鲍勃”。字典可能会增长到数百个。我并不倾向于不使用正则表达式，但作为一个相对新手，我试图找出是否有其他方式我现在没有。

编辑：@tripleee下面的第3条评论是一个很好的建议，适用于我想做的事情。谢谢一堆。

道歉，不是我打算为自己和答案中的某些人投票。所有的帮助都得到了赞赏。

Answer 1

您可以使用正则表达式（使用字典）。这不需要迭代字典条目。

import re

nonspaces = re.compile(r'\S+') # To extract the first word

def search_and_replace(filepath, replacement):
    def replace_sentence(match):
        def replace_name(match):
            name = match.group()
            return replacement.get(name, name)
        return nonspaces.sub(replace_name, match.group(), count=1)
        # count=1: to change only the first word.
    with open(filepath, 'r+') as f:
        replaced = re.sub('[^.!?]+', replace_sentence, f.read())
        f.seek(0)
        f.write(replaced)
        f.truncate() # NOTE: If name shrinks, unwanted string remains.


start_replacement = {
    'Bob': 'John',
    'Sam': 'Jack',
    'Tom': 'Kevin',
}
search_and_replace('start_words.txt', start_replacement)

关于使用的正则表达式的说明。

[^.!?]：匹配任何非.，!或?的字符。用于提取句子。

>>> re.findall('[^.!?]+', 'Bob is at the beginning. Bob after period!')
['Bob is at the beginning', ' Bob after period']

\S：匹配任何非空格字符。用于提取第一个单词（可能是名称）：

>>> re.search(r'\S+', 'Bob is at the beginning').group()
'Bob'
>>> re.search(r'\S+', '   Tom after period!').group()
'Tom'

>>> re.sub(r'\S+', 'John', '   Bob and Tom.')
'   John John John'
>>> re.sub(r'\S+', 'John', '   Bob and Tom.', count=1)
'   John and Tom.'

请参阅re module documentation和Regular Expression HOWTO。

Answer 2

您必须调整正在使用的数据或算法以解决此特殊情况。

例如，您可以使用某个值修饰数据的开头，并将相应的替换添加到您的字典中。

f_begin_deco = '\0\0\0'  # Sequence that won't be in data.

start_replacements = { f_begin_deco + 'Bob': f_begin_deco + 'John' }

# In your search_and_replace function.   
read_article = f_begin_deco + article.read()
replaced = read_article.replace(start_word, replacement)
replaced = replaced[len(f_begin_deco):]  # Remove beginning of file decoration.

您也可以探索context manager protocol来创建更优雅的数据装饰代码。

替代方法是更改搜索和替换算法，使其考虑特殊情况。

start_replacements = { 'Bob': 'John' }

# In your search_and_replace function.
if read_article.startswith(start_word):
    read_article = replacement + read_article[len(start_word):]

Answer 3

问题的问题：你为什么不想使用正则表达式？

>>> import re
>>> x = "! Bob is a foo bar"
>>> re.sub('^[!?.\\n\\s]*Bob','John', x)
'John is a foo bar'
>>> x[:2]+re.sub('^[!?.\\n\\s]*Bob','John', x)
'! John is a foo bar'

这是我没有正则表达式的尝试：

>>> x = "! Bob is a foo bar"
>>> first = ['!','?','.','\n']
>>> x = x.split()
>>> x[1] ="John" if x[1] == "Bob" and x[0] in first else x[1]
>>> x
['!', 'John', 'is', 'a', 'foo', 'bar']
>>> " ".join(x)
'! John is a foo bar'

正如@falsetru所说：

>>> x = "\n Bob is a foo bar"
>>> x = x.split()
>>> x[1] ="John" if x[1] == "Bob" and x[0] in first else x[1]
>>> " ".join(x)
'Bob is a foo bar'

解决str.split()移除\n的最简单方法可能是：

>>> x = "\n Bob is a foo bar"
>>> y = x.split()
>>> y[1] ="John" if y[1] == "Bob" and y[0] in first else y[1]
>>> y
['Bob', 'is', 'a', 'foo', 'bar']
>>> if x.split()[0] == "\n":
...     y.insert(0,'\n')
... 
>>> " ".join(y)
'Bob is a foo bar'
>>> y
['Bob', 'is', 'a', 'foo', 'bar']
>>> if x[0] == "\n":
...     y.insert(0,'\n')
... 
>>> " ".join(y)
'\n Bob is a foo bar'

我应该停止追加我的答案，否则我将只是宽恕OP以使用非正版解决方案，正则表达式很容易解决。

可以在不使用正则表达式的情况下匹配文本文件的开头？

3 个答案: