Question

我试图在文本文件中查找长度为7个字母并包含字母a，b，c，e和r的单词。到目前为止，我有这个：

import re

file = open("dictionary.txt","r")
text = file.readlines()
file.close()


keyword = re.compile(r'\w{7}')

for line in text:
    result = keyword.search (line)
    if result:
       print (result.group())

任何人都可以帮助我吗？

Answer 1

您不仅需要匹配单词字符，还需要匹配单词 boundary ：

keyword = re.compile(r'\b\w{7}\b')

\b锚点匹配单词的开头或结尾，将单词限制为正好 7个字符。

如果您逐行遍历文件而不是一次性将其全部读入内存，效率会更高：

import re

keyword = re.compile(r'\b\w{7}\b')

with open("dictionary.txt","r") as dictionary:    
    for line in dictionary:
        for result in keyword.findall(line):
            print(result)

使用keyword.findall()为我们提供了所有匹配的列表。

要检查匹配项中是否至少包含一个必需字符，我个人只会使用一组交集测试：

import re

keyword = re.compile(r'\b\w{7}\b')
required = set('abcer')

with open("dictionary.txt","r") as dictionary:    
    for line in dictionary:
        results = [required.intersection(word) for word in keyword.findall(line)]
        for result in results
            print(result)

Answer 2

\b(?=\w{0,6}?[abcer])\w{7}\b

这是你想要的正则表达式。它的工作原理是使用基本形式为一个正好七个字母（\b\w{7}\b）的单词并添加一个前瞻 - 一个向前看的零宽度断言，并试图找到你需要的一个字母。细分：

\b            A word boundary
(?=           Look ahead and find...
    \w        A word character (A-Za-z0-9_)
    {0,6}     Repeated 0 to 6 times
    ?         Lazily (not necessary, but marginally more efficient).
    [abcer]   Followed by one of a, b, c, e, or r
)             Go back to where we were before (just after the word boundary
\w            And match a word character
{7}           Exactly seven times.
\b            Then one more word Boundary.

在文本文件中查找包含特定字符且具有特定长度的单词

2 个答案: