Question

我有代码循环遍历文件，从列表中递归查找单词。如果找到它，则打印出找到的文件，搜索到的字符串以及找到的行。

我的问题是，当搜索api也匹配myapistring时，'pass'匹配'compass'，'dev'匹配'device'而不是实际的字。所以我需要在某个地方实现一个正则表达式，但我不确定for循环的哪个部分和哪个部分。

我（我认为）有效的正则表达式是：

regex='([\w.]+)'

rootpath=myDir
wordlist=["api","pass","dev"]
exclude=["testfolder","testfolder2"]
complist=[]

for word in wordlist:
        complist.extend([re.compile(word)])

    for path,name,fname in os.walk(rootpath):
        name[:] = [d for d in name if d not in exclude]
        for fileNum in fname:
            i=path+"/"+fileNum
            files.append(i)

    for fileLine in files:
        if any(ext in fileLine for ext in exten):    
            count=0 
            for line in open(fileLine, "r").readlines():
                count=count+1
                for lv in complist:
                    match = lv.findall(line, re.IGNORECASE)

                    for mat in match: 
                        [print output]

由于

编辑：添加了以下提供的代码：

for word in wordlist:
        complist.extend([re.compile('\b' + re.escape(word) + '\b')])

哪个适用于一些错误，但足够好以至于我可以使用。

Answer 1

而不是：

for word in wordlist:
    complist.extend([re.compile(word)])

使用word boundaries：

for word in wordlist:
    complist.extend([re.compile(r'\b{}\b'.format(word))])

\b是单词开头或结尾的零长度匹配，因此\bthe\b将匹配此行：

the lazy dog

但不是这一行：

then I checked StackOverflow

我想指出的另一件事是，如果word包含任何对正则表达式引擎有意义的特殊字符，它们将被解释为正则表达式的一部分。所以，而不是：

complist.extend([re.compile(r'\b{}\b'.format(word))])

使用：

complist.extend([re.compile(r'\b{}\b'.format(re.escape(word)))])

修改：如评论中所述，您还希望匹配由_分隔的字词。 _被视为＆＃34;字符＆＃34;通过Python，将它作为单词分隔符包含在内，你可以这样做：

re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word)))

在此处查看此工作：

In [45]: regex = re.compile(r'(?:\b|_){}(?:\b|_)'.format(re.escape(word)))

In [46]: regex.search('this line contains is_admin')
Out[46]: <_sre.SRE_Match at 0x105bca3d8>

In [47]: regex.search('this line contains admin')
Out[47]: <_sre.SRE_Match at 0x105bca4a8>

In [48]: regex.search("does not have the word")

In [49]: regex.search("does not have the wordadminword")

findall（）regex在迭代查找列表中的单词的文件时

1 个答案: