在文本文件中查找包含特定字符且具有特定长度的单词

时间:2013-04-09 19:41:17

标签: python regex

我试图在文本文件中查找长度为7个字母并包含字母a,b,c,e和r的单词。到目前为止,我有这个:

import re

file = open("dictionary.txt","r")
text = file.readlines()
file.close()


keyword = re.compile(r'\w{7}')

for line in text:
    result = keyword.search (line)
    if result:
       print (result.group())

任何人都可以帮助我吗?

2 个答案:

答案 0 :(得分:2)

您不仅需要匹配单词字符,还需要匹配单词 boundary

keyword = re.compile(r'\b\w{7}\b')

\b锚点匹配单词的开头或结尾,将单词限制为正好 7个字符。

如果您逐行遍历文件而不是一次性将其全部读入内存,效率会更高:

import re

keyword = re.compile(r'\b\w{7}\b')

with open("dictionary.txt","r") as dictionary:    
    for line in dictionary:
        for result in keyword.findall(line):
            print(result)

使用keyword.findall()为我们提供了所有匹配的列表。

要检查匹配项中是否至少包含一个必需字符,我个人只会使用一组交集测试:

import re

keyword = re.compile(r'\b\w{7}\b')
required = set('abcer')

with open("dictionary.txt","r") as dictionary:    
    for line in dictionary:
        results = [required.intersection(word) for word in keyword.findall(line)]
        for result in results
            print(result)

答案 1 :(得分:1)

\b(?=\w{0,6}?[abcer])\w{7}\b

这是你想要的正则表达式。它的工作原理是使用基本形式为一个正好七个字母(\b\w{7}\b)的单词并添加一个前瞻 - 一个向前看的零宽度断言,并试图找到你需要的一个字母。细分:

\b            A word boundary
(?=           Look ahead and find...
    \w        A word character (A-Za-z0-9_)
    {0,6}     Repeated 0 to 6 times
    ?         Lazily (not necessary, but marginally more efficient).
    [abcer]   Followed by one of a, b, c, e, or r
)             Go back to where we were before (just after the word boundary
\w            And match a word character
{7}           Exactly seven times.
\b            Then one more word Boundary.