Question

我有一个很大的txt文件，想提取具有以下模式的所有字符串：

APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC

这是我尝试过的：

/m/meet_the_crr
/m/commune
/m/hann_2

我得到的结果是一个简单的“无”。我在这里做什么错了？

Answer 1

您的代码在逻辑上没有错，实际上您的模式将与您描述的输入匹配：

result = re.match(r'^\/m\/[a-zA-Z0-9_-]+$', '/m/meet_the_crr')
if result:
    print(result.groups())    # this line is reached, as there is a match

由于您未指定任何捕获组，因此您会看到()被打印到控制台。您可以捕获整个输入，然后可用，例如

result = re.match(r'(^\/m\/[a-zA-Z0-9_-]+$)', '/m/meet_the_crr')
if result:
    print(result.groups(1)[0])

/m/meet_the_crr

Answer 2

您正在使用.read()将整个文件读入一个变量（到内存中）。使用.replace("\n", "")，您可以删除字符串中的所有换行符。 re.match(r'^\/m\/[a-zA-Z0-9_-]+$', contents)试图匹配与\/m\/[a-zA-Z0-9_-]+模式完全匹配的字符串，并且在进行了所有先前的操作之后是不可能的。

至少有两种方法。删除.replace("\n", "")（以防止换行符删除）并使用re.findall(r'^/m/[\w-]+$', contents, re.M)（re.M选项将启用匹配整个行而不是整个文本），或读取文件逐行，并使用您的re.match版本检查每行是否匹配，如果匹配则添加到最终列表中。

示例：

import re
with open("testfile.txt", "r") as text_file:
    contents = text_file.read()
    print(re.findall(r'^/m/[\w-]+$', contents, re.M))

或

import re
with open("testfile.txt", "r") as text_file:
    for line in text_file:
        if re.match(r'/m/[\w-]+\s*$', line):
            print(line.rstrip())

请注意，我使用\w来使模式更短一些，但是如果您使用的是Python 3，并且只想匹配ASCII字母和数字，请同时使用re.ASCII选项。

此外，/在Python正则表达式模式中也不是特殊字符，无需转义。

Answer 3

您需要不删除行尾并使用re.MULTILINE标志，以便从返回的较大文本中获得多个结果：

# write a demo file
with open("t.txt","w") as f:
    f.write("""
/m/meet_the_crr\n
/m/commune\n
/m/hann_2\n\n
# your text looks like this after .read().replace(\"\\n\",\"\")\n
/m/meet_the_crr/m/commune/m/hann_2""")

程序：

import re

regex = r"^\/m\/[a-zA-Z0-9_-]+$"

with open("t.txt","r") as f:
    contents = f.read()

found_all =  re.findall(regex,contents,re.M) 

print(found_all)
print("-")
print(open("t.txt").read())

输出：

['/m/meet_the_crr', '/m/commune', '/m/hann_2']

文件内容：

/m/meet_the_crr

/m/commune

/m/hann_2


# your text looks like this after .read().replace("\n","")

/m/meet_the_crr/m/commune/m/hann_2

这是关于 Wiktor Stribiżew 在他的评论中告诉您的内容-尽管他建议也使用更好的模式：r'^/m/[\w-]+$'

用正则表达式匹配简单字符串不起作用？

3 个答案: