案例1：带链接的交叉引用，带嵌入式书架文本的数字xt

Question

我需要发布大量的XHTML文件，我没有生成这些文件，所以我无法修复生成它的代码。我不能使用正则表达式来演绎整个文件，只是高度选择性的片段，因为有些链接和ID都有我无法全局变化的数字。

我已经简化了这个例子，因为原始文件有RTL文本。我只对修改可见文本中的数字感兴趣，而不是修改标记。似乎有3种不同的情况。

来自bk1.xhtml的片段：

案例1：带链接的交叉引用，带嵌入式书架文本的数字xt

<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a>
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

案例2：没有链接的交叉引用 - xt中的数字没有嵌入的书架文本

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a>
<span class="xt">some text with these digits: 26:118</span></p></aside>

案例3：没有链接的脚注，但在ft文本中有数字

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a>
<span class="ft">some text with these digits: 22</span></p></aside>

我试图找出如何识别可见用户部分内的文本字符串，以便我只修改相关数字：

案例1：我需要捕捉 <a class='bookref' href='bk1.xhtml#bk1_118_26'>some text 26:118</a>分配＆＃34;某些文字26：118＆＃34;子串到变量并对该变量运行正则表达式;然后将该子字符串替换回文件所在的位置。

案例2：我只需捕获<span class="xt">some text 26:118</span>并仅更改＆＃34;某些文字26：118＆＃34;中的数字。 substring并针对该变量运行正则表达式;然后将该子字符串替换回文件所在的位置。

案例3：我只需捕获<span class="ft">some text 22</span>并仅更改＆＃34;某些文本22＆＃34;中的数字。 substring并针对该变量运行正则表达式;然后将该子字符串替换回文件所在的位置。

我在很多文件中都有成千上万的这样做。我知道如何遍历文件。

在我处理完一个文件中的所有模式后，我需要写出更改后的树。

我只需要发布它来修复文本。

我一直在谷歌搜索，阅读和观看很多教程，我感到很困惑。

感谢您提供任何帮助。

Answer 1

您似乎想要.replaceWith()方法，首先要查找要匹配的所有文本：

from bs4 import BeautifulSoup

cases = '''
<aside epub:type='footnote' id="FN96"><p class="x"><a class="notebackref" href="#bk1_21_9"><span class="notemark">*</span>text</a>
<span class="xt"> <a class='bookref' href='bk50.xhtml#bk50_118_26'>some text with these digits: 26:118</a></span></p></aside>

<aside epub:type='footnote' id="FN100"><p class="x"><a class="notebackref" href="#bk1_21_42"><span class="notemark">*</span>text</a>
<span class="xt">some text with these digits: 26:118</span></p></aside>

<aside epub:type='footnote' id="FN107"><p class="f"><a class="notebackref" href="#bk1_22_44"><span class="notemark">§</span>text</a>
<span class="ft">some text with these digits: 22</span></p></aside>
'''

soup = BeautifulSoup(cases, 'lxml')

case1 = soup.findAll('a',{'class':'bookref'})
case2 = soup.findAll('span',{'class':'xt'})
case3 = soup.findAll('span',{'class':'ft'})

for match in case1 + case2 + case3:
    text = match.string
    print(text)
    if text:
        newText = text.replace('some text', 'modified!') # this line is your regex things
        text.replaceWith(newText)

循环中的print(text)打印：

some text with these digits: 26:118
None
some text with these digits: 26:118
some text with these digits: 22

如果我们再次打电话，现在：

modified! with these digits: 26:118
None
modified! with these digits: 26:118
modified! with these digits: 22

Python：BeautifulSoup修改文本

案例1：带链接的交叉引用，带嵌入式书架文本的数字xt

案例2：没有链接的交叉引用 - xt中的数字没有嵌入的书架文本

案例3：没有链接的脚注，但在ft文本中有数字

1 个答案: