Question

我需要在下面的文本序列中搜索一些XML文件（它们都具有相同的名称，pom.xml）（也在子文件夹中），所以如果有人写了一些文本甚至是空白，我必须得到警报：

     <!--
     | Startsection
     |-->         
    <!-- 
     | Endsection
     |-->

我正在运行以下Python脚本，但仍然没有完全匹配，即使它是部分内部文本，我也会收到警报：

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+| Startsection\s+|-->\s+<!--\s+| Endsection\s+|-->")
tag="<module>"

for root, dirs, files in os.walk("."):

    if "pom.xml" in files:
        p=join(root, "pom.xml") 
        print("Checking",p)
        with open(p) as f:
            s=f.read()
        if tag in s and comment.search(s):
            print("Matched",p)

更新＃3

如果<module>

之间存在标记|--> <!--的内容，我打算打印出来

进入搜索：

 <!--
 | Startsection
 |-->         
 <!-- 
 | Endsection
 |-->

例如匹配后打印，以及文件名，在下面的例子中也打印“example.test1”：

     <!--
     | Startsection
     |-->         
       <module>example.test1</module>
     <!-- 
     | Endsection
     |-->

更新＃4

应该使用以下内容：

import re
import os
from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"

for root, dirs, files in os.walk("/home/temp/test_folder/"):
 for skipped in ("test1", "test2", ".repotest"):
    if skipped in dirs: dirs.remove(skipped)

 if "pom.xml" in files:
    p=join(root, "pom.xml") 
    print("Checking",p)
    with open(p) as f:
       s=f.read()
       if tag in s and comment.search(s):
          print("The following files are corrupted ",p)

更新＃5

import re
import os
import xml.etree.ElementTree as etree 
from bs4 import BeautifulSoup 
from bs4 import Comment

from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"

for root, dirs, files in os.walk("myfolder"):
 for skipped in ("model", "doc"):
    if skipped in dirs: dirs.remove(skipped)

 if "pom.xml" in files:
    p=join(root, "pom.xml") 
    print("Checking",p)
    with open(p) as f:
       s=f.read()
       if tag in s and comment.search(s):
          print("ERROR: The following file are corrupted",p)



bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
    # Check if it's the start of the code
    if "Start of user code" in c:
        modules = [m for m in c.findNextSiblings(name='module')]
        for mod in modules:
            print(mod.text)

Answer 1

不要使用正则表达式解析XML文件。 The best Stackoverflow answer ever can explain you why

您可以使用BeautifulSoup来帮助完成该任务

看看从代码中提取内容有多简单

from bs4 import BeautifulSoup

content = """
    <!--
     | Start of user code (user defined modules)
     |-->

    <!--
     | End of user code
     |-->
"""

bs = BeautifulSoup(content, "html.parser")
print(''.join(bs.contents))

当然你可以使用你的xml文件而不是我正在使用的文字

bs = BeautifulSoup(open("pom.xml"), "html.parser")

使用预期输入的小例子

from bs4 import BeautifulSoup
from bs4 import Comment

bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
    # Check if it's the start of the code
    if "Start of user code" in c:
        modules = [m for m in c.findNextSiblings(name='module')]
        for mod in modules:
            print(mod.text)

但如果您的代码始终位于模块标记中，我不知道为什么您应该关注之前/之后的注释，您只需在模块中找到代码直接标记

Answer 2

＆＃34; |（）＆＃34;必须转义字符，并将re.MULTILINE添加到正则表达式。

comment=re.compile(r"\s+", re.MULTILINE)

编辑：您还可以在正则表达式中添加换行符：\ n

任意（或没有）白色空间将是：\ s *

您可以在此处找到有关python正则表达式的更多信息：https://docs.python.org/2/library/re.html

XML文件中的精确字符串搜索？

2 个答案: