Question

我试图用不同文档中的标记内容（称之为源代码）替换文档中的每个XML标记（将其称为目标）。源中的标记可以只包含文本，也可以包含更多XML。

以下是我无法工作的简单示例：

测试source.htm ：

<?xml version="1.0" encoding="utf-8"?>
<html>
    <head>
    </head>
    <body>
    <srctxt>text to be added</srctxt>
    </body>
</html>

测试target.htm ：

<?xml version="1.0" encoding="utf-8"?>
<html>
    <head>
    </head>
    <body>
    <replacethis src="test-source.htm"></replacethis>
    <p>irrelevant, just here for filler</p>
    <replacethis src="test-source.htm"></replacethis>
    </body>
</html>

replace_example.py ：

import os
import re
from bs4 import BeautifulSoup
# Just for testing

source_file = "test-source.htm"
target_file = "test-target.htm"

with open(source_file) as s:
    source = BeautifulSoup(s, "lxml")

with open(target_file) as t:
    target = BeautifulSoup(t, "lxml")

source_tag = source.srctxt

for tag in target():
    for attribute in tag.attrs:
        if re.search(source_file, str(tag[attribute])):
            tag.replace_with(source_tag)

with open(target_file, "w") as w:
    w.write(str(target))

运行test-target.htm 后，这是我的不幸replace_example.py

<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>

<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>

第一个replacethis标记现已消失，第二个replacethis标记已被替换。同样的问题发生在＆＃34; insert＆＃34;和＆＃34; insert_before＆＃34;。

我想要的输出是：

<?xml version="1.0" encoding="utf-8"?><html>
<head>
</head>
<body>
<srctxt>text to be added</srctxt>    
<p>irrelevant, just here for filler</p>
<srctxt>text to be added</srctxt>
</body>
</html>

有人可以指出我正确的方向吗？

其他并发症：以上示例是最简单的情况，我可以重现我似乎与BeautifulSoup有关的问题，但它没有传达问题的全部细节我是试图解决。实际上，我有一份目标和来源清单。仅当replacethis属性包含对列表中的源的引用时，src标记才需要由源的内容替换。所以我可以使用替换方法，但它需要编写更多的正则表达式，而不是说我可以说服BeautifulSoup工作。如果这个问题是一个BeautifulSoup错误，那么我可能只需要编写正则表达式。

Answer 1

如果你想摆脱额外的标签，你可以使用另一个解析器（html.parser）。

BS4的replace_with行为看起来像库中的一些错误。

作为部分解决方案，您只需致电

target_text.replace('<replacethis></replacethis>', source_text)

Answer 2

首先，强烈建议不要使用regex on [X]HTML documents。由于您正在修改XML内容，因此请考虑安装的lxml解决方案是BeautifulSoup调用中的解析引擎。此方法不需要for或if逻辑。

具体来说，考虑XSLT，这是一种专用语言，旨在将XML转换为其他XML，HTML甚至json / csv / txt文件。 XSLT维护document()函数，允许您解析文档。 Python的lxml可以运行XSLT 1.0脚本。

XSLT （将.xsl另存为与源文件相同的文件夹，调整'replacethis'和'srctxt'名称）

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- UPDATE <replacethis> TAG WITH <srctxt> FROM SOURCE -->
  <xsl:template match="replacethis">
    <xsl:copy-of select="document('test-source.htm')/html/body/srctxt"/>
  </xsl:template>

</xsl:stylesheet>

<强>的Python

import lxml.etree as et

# LOAD XML AND XSL SOURCES
doc = et.parse('test-target.htm')
xsl = et.parse('XSLTScript.xsl')

# TRANSFORM SOURCE
transform = et.XSLT(xsl)
result = transform(doc)

# OUTPUT TO SCREEN
print(result)    

# OUTPUT TO FILE
with open('test-target.htm', 'wb') as f:
    f.write(result)

<强>输出

<?xml version="1.0"?>
<html>
  <head/>
  <body>
    <srctxt>text to be added</srctxt>
    <p>irrelevant, just here for filler</p>
    <srctxt>text to be added</srctxt>
  </body>
</html>

使用BeautifulSoup将XML标记的每次出现替换为另一个标记

2 个答案: