我需要在下面的文本序列中搜索一些XML文件(它们都具有相同的名称,pom.xml)(也在子文件夹中),所以如果有人写了一些文本甚至是空白,我必须得到警报:
<!--
| Startsection
|-->
<!--
| Endsection
|-->
我正在运行以下Python脚本,但仍然没有完全匹配,即使它是部分内部文本,我也会收到警报:
import re
import os
from os.path import join
comment=re.compile(r"<!--\s+| Startsection\s+|-->\s+<!--\s+| Endsection\s+|-->")
tag="<module>"
for root, dirs, files in os.walk("."):
if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("Matched",p)
更新#3
如果<module>
|--> <!--
的内容,我打算打印出来
进入搜索:
<!--
| Startsection
|-->
<!--
| Endsection
|-->
例如匹配后打印,以及文件名,在下面的例子中也打印“example.test1”:
<!--
| Startsection
|-->
<module>example.test1</module>
<!--
| Endsection
|-->
更新#4
应该使用以下内容:
import re
import os
from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"
for root, dirs, files in os.walk("/home/temp/test_folder/"):
for skipped in ("test1", "test2", ".repotest"):
if skipped in dirs: dirs.remove(skipped)
if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("The following files are corrupted ",p)
更新#5
import re
import os
import xml.etree.ElementTree as etree
from bs4 import BeautifulSoup
from bs4 import Comment
from os.path import join
comment=re.compile(r"<!--\s+\| Startsection\s+\|-->\s+<!--\s+\| Endsection\s+\|-->", re.MULTILINE)
tag="<module>"
for root, dirs, files in os.walk("myfolder"):
for skipped in ("model", "doc"):
if skipped in dirs: dirs.remove(skipped)
if "pom.xml" in files:
p=join(root, "pom.xml")
print("Checking",p)
with open(p) as f:
s=f.read()
if tag in s and comment.search(s):
print("ERROR: The following file are corrupted",p)
bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
# Check if it's the start of the code
if "Start of user code" in c:
modules = [m for m in c.findNextSiblings(name='module')]
for mod in modules:
print(mod.text)
答案 0 :(得分:1)
不要使用正则表达式解析XML文件。 The best Stackoverflow answer ever can explain you why
您可以使用BeautifulSoup来帮助完成该任务
看看从代码中提取内容有多简单
from bs4 import BeautifulSoup
content = """
<!--
| Start of user code (user defined modules)
|-->
<!--
| End of user code
|-->
"""
bs = BeautifulSoup(content, "html.parser")
print(''.join(bs.contents))
当然你可以使用你的xml文件而不是我正在使用的文字
bs = BeautifulSoup(open("pom.xml"), "html.parser")
使用预期输入的小例子
from bs4 import BeautifulSoup
from bs4 import Comment
bs = BeautifulSoup(open(p), "html.parser")
# Extract all comments
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
for c in comments:
# Check if it's the start of the code
if "Start of user code" in c:
modules = [m for m in c.findNextSiblings(name='module')]
for mod in modules:
print(mod.text)
但如果您的代码始终位于模块标记中,我不知道为什么您应该关注之前/之后的注释,您只需在模块中找到代码直接标记
答案 1 :(得分:0)
&#34; |()&#34;必须转义字符,并将re.MULTILINE添加到正则表达式。
comment=re.compile(r"<!--\s+\| Start of user code \(user defined modules\)\s+\|-->\s+<!--\s+\| End of user code\s+\|-->", re.MULTILINE)
编辑:您还可以在正则表达式中添加换行符:\ n
任意(或没有)白色空间将是:\ s *
您可以在此处找到有关python正则表达式的更多信息:https://docs.python.org/2/library/re.html