我尝试了很多方法,但似乎没有那么多进展。我想将<
标记之间的>
和<p>
更改为实体,但
标记之间的内容之间存在换行符: 例如:
<html>
<li>This is a test file.</li>
<p>This is the sentence I want to <a
href="XXXX.com">do</a> the entity conversion.</p>
<p>This is a second <li>sentence</li>.</p>
</html>
预期结果是:
<html>
<li>This is a test file.</li>
<p>This is the sentence I want to <a
href="XXXX.com">do</a> the entity conversion.</p>
<p>This is a second <li>sentence</li>.</p>
</html>
我似乎无法找到所有<
和>
进行转换。
如果我尝试使用正则表达式<seg.*(<)
,则无法找到所有<
;如果我尝试使用<(?=.*<\/p>)(?!p>)
之类的正面后瞻,它可以&当<
内容中出现换行符时,找不到所有<p>
。
如果我尝试使用正面lookbehind + re.DOTALL <(?=.*<\/p>)(?!p>)
,它会找到所有<
包括那些不需要的内容......
如果您对更好的正则表达式或更好的方法有任何想法,请告诉我。非常感谢你!
答案 0 :(得分:1)
使用xml.dom
模块:
from xml.dom import minidom
doc = minidom.parse("yourfile")
for p in doc.getElementsByTagName('p'):
text_node = doc.createTextNode(p.childNodes[1].toxml())
p.replaceChild(text_node, p.childNodes[1])
print(doc.childNodes[0].toxml())
输出:
<html>
<li>This is a test file.</li>
<p>This is the sentence I want to <a href="XXXX.com">do</a> the entity conversion.</p>
<p>This is a second <li>sentence</li>.</p>
</html>
答案 1 :(得分:0)
你也可以试试这个: -
with open('old.txt', 'r') as f:
d = f.readlines()
#open a temporary file
out = open('filesort.txt', 'w')
check1 = True
check2 = False
endFound = False
for line in d:
nl = line
if(check1 and (line.find("<p>")) != -1):
if(line.find("</p>") != -1):
endFound = True
nl = nl.replace('<','<')
nl = nl.replace('>','>')
nl = nl.replace('<p>','<p>')
if(endFound):
nl = nl.replace('</p>','</p>')
check1 = True
check2 = False
else:
check1 = False
check2 = True
elif(check2):
if(line.find("</p>") != -1):
endFound = True
nl = nl.replace('<','<')
nl = nl.replace('>','>')
nl = nl.replace('<p>','<p>')
if(endFound):
nl = nl.replace('</p>','</p>')
check1 = True
check2 = False
else:
check1 = False
check2 = True
out.writelines(nl)