Question

我尝试了很多方法，但似乎没有那么多进展。我想将<标记之间的>和<p>更改为实体，但

标记之间的内容之间存在换行符：例如：

<html>
  <li>This is a test file.</li>
  <p>This is the sentence I want to <a
  href="XXXX.com">do</a> the entity conversion.</p>
  <p>This is a second <li>sentence</li>.</p>
</html>

预期结果是：

<html>
  <li>This is a test file.</li>
  <p>This is the sentence I want to &lt;a
  href="XXXX.com"&gt;do&lt;/a&gt; the entity conversion.</p>
  <p>This is a second &lt;li&gt;sentence&lt;/li&gt;.</p>
</html>

我似乎无法找到所有<和>进行转换。
如果我尝试使用正则表达式<seg.*(<)，则无法找到所有<;如果我尝试使用<(?=.*<\/p>)(?!p>)之类的正面后瞻，它可以＆当<内容中出现换行符时，找不到所有<p>。
如果我尝试使用正面lookbehind + re.DOTALL <(?=.*<\/p>)(?!p>)，它会找到所有<包括那些不需要的内容......

如果您对更好的正则表达式或更好的方法有任何想法，请告诉我。非常感谢你！

Answer 1

使用xml.dom模块：

from xml.dom import minidom

doc = minidom.parse("yourfile")
for p in doc.getElementsByTagName('p'):
    text_node = doc.createTextNode(p.childNodes[1].toxml())
    p.replaceChild(text_node, p.childNodes[1])

print(doc.childNodes[0].toxml())

输出：

<html>
  <li>This is a test file.</li>
  <p>This is the sentence I want to &lt;a href=&quot;XXXX.com&quot;&gt;do&lt;/a&gt; the entity conversion.</p>
  <p>This is a second &lt;li&gt;sentence&lt;/li&gt;.</p>
</html>

Answer 2

你也可以试试这个： -

with open('old.txt', 'r') as f:
 d = f.readlines()
#open a temporary file
out = open('filesort.txt', 'w')
check1 = True
check2 = False
endFound = False
for line in d:
nl = line
if(check1 and (line.find("<p>")) != -1):
    if(line.find("</p>") != -1):
        endFound = True
    nl = nl.replace('<','&lt;')
    nl = nl.replace('>','&gt;')
    nl = nl.replace('&lt;p&gt;','<p>')
    if(endFound):
        nl = nl.replace('&lt;/p&gt;','</p>')
        check1 = True
        check2 = False
    else:
        check1 = False
        check2 = True
elif(check2):
    if(line.find("</p>") != -1):
        endFound = True
    nl = nl.replace('<','&lt;')
    nl = nl.replace('>','&gt;')
    nl = nl.replace('&lt;p&gt;','<p>')
    if(endFound):
        nl = nl.replace('&lt;/p&gt;','</p>')
        check1 = True
        check2 = False
    else:
        check1 = False
        check2 = True   
out.writelines(nl)

如何在XML文件中迭代或查找所有想要的事件并替换它们？

2 个答案: