在Python中使用BeautifulSoup将标签和字符串混合查找和替换

时间:2018-11-25 20:18:30

标签: python python-3.x beautifulsoup

如何使用BeautifulSoup .replace_with(),而在>字符串转换查找和替换过程之后,尖括号不会转换为str()吗?

Python代码

from bs4 import BeautifulSoup

with open("../dicttest.txt", "r", encoding="utf-8") as f:
    full_text = f.read()
    parse_1 = BeautifulSoup(full_text, "html.parser")
    for line in parse_1.find_all("grace", "AllExamples"):
        match = str(line).replace(";</i> <b>", ";</i><br> <b>")
        line.replace_with(match)
        print(parse_1)

dicttest.txt

all
<link rel="stylesheet" type="text/css" href="stylesheet.css"><font size="-2">Duden-Oxford Deutsch-Englisch</font><br><grace class="SglMngArticle"><span class="WordHead"><b>all</b></span> <grace class="IPA">/al/</grace> <i>Indefinitpron.</i> <i>u. unbest. Zahlw.</i> </grace><br><br><grace class="NumArticle"><span class="Number">1.</span> <i>attr.</i> (<i>ganz, gesamt...</i>) all; </grace><grace class="AllExamples"><grace class="BoldExamples"><b>in aller Deutlichkeit</b></grace> in all clarity;<br> <grace class="BoldExamples"><b>alle Freude, die sie empfunden hat</b></grace> all the joy she felt;<br> <grace class="BoldExamples"><b>alles Geld, das ich noch habe</b></grace> all the money I have left;<br> <grace class="BoldExamples"><b>aller Eifer nützte ihm nichts</b></grace> all his zeal was to no avail;<br> <grace class="BoldExamples"><b>ich kann diese Leute alle nicht leiden</b></grace> I can't stand any of these people;<br> <grace class="BoldExamples"><b>ich will euch alle nicht mehr sehen</b></grace> I don't want to see any of you again;<br> <grace class="BoldExamples"><b>die Ärzte verdienen alle sehr viel</b></grace> doctors all earn a great deal;<br> <grace class="BoldExamples"><b>alles Geld spendete sie dem Roten Kreuz</b></grace> she donated all her money to the Red Cross;<br> <grace class="BoldExamples"><b>alles Leid der Welt</b></grace> all the suffering in the world;<br> <grace class="BoldExamples"><b>all unser/mein </b><i>usw.</i> <b>...</b> all our/my <i>etc. ...;</i> <b>alles andere/Weitere/Übrige</b></grace> everything else;<br> <grace class="BoldExamples"><b>alles Übrige hat sich nicht geändert</b></grace> nothing else has changed;<br> <grace class="BoldExamples"><b>alles Schöne/Neue/Fremde</b></grace> everything <i>or</i> all that is beautiful/new/strange;<br> <grace class="BoldExamples"><b>alles Gute!</b></grace> all the best!;<br> <grace class="BoldExamples"><b>alle Fenster schließen</b></grace> close all the windows;<br> <grace class="BoldExamples"><b>sie gaben alle Waffen ab</b></grace> they handed in all their weapons;<br> <grace class="BoldExamples"><b>wir/ihr/sie alle</b></grace> all of us/you/them; we/you/they all;<br> <grace class="BoldExamples"><b>das sagen sie alle</b></grace> (<i>ugs.</i>) that's what they all say;<br> <grace class="BoldExamples"><b>alle Beteiligten/Anwesenden</b></grace> all those involved/present;<br> <grace class="BoldExamples"><b>trotz aller Vorbehalte werde ich ...</b></grace> in spite of all my reservations I shall ...;<br> <grace class="BoldExamples"><b>alle beide/alle zehn</b></grace> both of them/all ten of them;<br> <grace class="BoldExamples"><b>alle Männer/Frauen/Kinder</b></grace> all men/women/children;<br> <grace class="BoldExamples"><b>alle Mädchen über zwölf Jahre</b></grace> all girls over twelve;<br> <grace class="BoldExamples"><b>alle Mädchen in der Schule</b></grace> all the girls in the school;<br> <grace class="BoldExamples"><b>alle Bewohner der Stadt</b></grace> all the inhabitants of the town;<br> <grace class="BoldExamples"><b>ohne allen Anlass</b></grace> for no reason [at all]; without any reason [at all];<br> <grace class="BoldExamples"><b>gegen alle Erwartungen</b></grace> contrary to all expectations;<br> <grace class="BoldExamples"><b>alle Jahre wieder</b></grace> every year;<br> <grace class="BoldExamples"><b>alle fünf Minuten/Meter</b></grace> every five minutes/metres;<br> <grace class="BoldExamples"><b>Bücher aller Art</b></grace> books of all kinds; all kinds of books;<br> <grace class="BoldExamples"><b>in aller Eile</b></grace> with all haste;<br> <grace class="BoldExamples"><b>in aller Ruhe</b></grace> in peace and quiet;<br> <grace class="BoldExamples"><b>trotz aller Versuche/Anstrengungen</b></grace> despite all [his/her/their/<i>etc.</i>] attempts/efforts. </grace><br><br><grace class="NumArticle"><span class="Number">2.</span> <i>allein stehend</i> </grace><br><br><grace class="LetterArticle"><span class="Letter">a) </span>(<i>gesamt..., sämtlich</i>) everything; </grace><grace class="AllExamples"><grace class="BoldExamples"><b>alles geht vorüber</b></grace> everything passes [in time];<br> <grace class="BoldExamples"><b>alles für die Braut/den Bastler</b></grace> everything for the bride/handicraft enthusiast;<br> <grace class="BoldExamples"><b>das alles</b></grace> all that;<br> <grace class="BoldExamples"><b>ich weiß nicht, was das alles soll</b></grace> I don't know what all that is supposed to mean;<br> <grace class="BoldExamples"><b>das ist alles Unsinn</b></grace> that is all nonsense;<br> <grace class="BoldExamples"><b>von allem etwas verstehen/wissen</b></grace> understand/know a bit about everything;<br> <grace class="BoldExamples"><b>wer alles war </b><i>od.</i> <b>wer war alles dort</b></grace> who was there?;<br> <grace class="BoldExamples"><b>wen alles habt ihr getroffen?</b></grace> who did you meet?;<br> <grace class="BoldExamples"><b>das sind alles Gauner</b></grace> they're all scoundrels;<br> <grace class="BoldExamples"><b>was gab es dort alles zu sehen?</b></grace> what was there to see?;<br> <grace class="BoldExamples"><b>was es nicht alles gibt!</b></grace> well, would you believe it!; well, I never!;<br> <grace class="BoldExamples"><b>all[es] und &nbsp;jedes</b></grace> everything; (<i>wahllos</i>) anything and everything;<br> <grace class="BoldExamples"><b>trotz allem</b></grace> in spite of <i>or</i> despite everything;<br> <grace class="BoldExamples"><b>sie liebt ihren Hund über alles</b></grace> she loves her dog more than anything else;<br> <grace class="BoldExamples"><b>zu allem fähig sein</b></grace> (<i>fig.</i>) be capable of anything;<br> <grace class="BoldExamples"><b>alles schon mal da gewesen</b></grace> (<i>ugs.</i>) it's all happened before;<br> <grace class="BoldExamples"><b>das kenne ich alles schon</b></grace> I've heard it all before;<br> <grace class="BoldExamples"><b>alles in allem</b></grace> all in all;<br> <grace class="BoldExamples"><b>vor allem</b></grace> above all;<br> <grace class="BoldExamples"><b>alles klar </b><i>od.</i> <b>in Ordnung</b></grace> (<i>ugs.</i>) everything's fine <i>or</i> (<i>coll.</i>) OK;<br> <grace class="BoldExamples"><b>alles klar?</b></grace> everything all right <i>or</i> (<i>coll.</i>) OK?;<br> <grace class="BoldExamples"><b>dann treffen wir uns um 5<sup>00</sup> Uhr, alles klar?</b></grace> we'll meet at 5 o'clock then, all right <i>or</i> (<i>coll.</i>) OK?;<br> <grace class="BoldExamples"><b>das ist alles</b></grace> that's all <i>or</i> (<i>coll.</i>) it;<br> <grace class="BoldExamples"><b>ist das alles?</b></grace> is that all <i>or</i> (<i>coll.</i>) it?;<br> <grace class="BoldExamples"><b>nach allem, was man hört/weiß</b></grace> to judge from everything <i>or</i> all one hears/knows; </grace><br><grace class="LetterArticle"><span class="Letter">b) </span>(<i>jeder einzelne</i>) everyone; </grace><grace class="AllExamples"><grace class="BoldExamples"><b>alle miteinander</b></grace> all together;<br> <grace class="BoldExamples"><b>ihr seid/wir sind/sie sind ..., alle miteinander</b></grace> you/we/they are ..., all of you/us/them;<br> <grace class="BoldExamples"><b>alle auf einmal</b></grace> all at once;<br> <grace class="BoldExamples"><b>sprecht nicht alle auf einmal!</b></grace> don't all speak at once;<br> <grace class="BoldExamples"><b>am besten, wir gehen alle auf einmal zum Chef</b></grace> the best thing would be for us all to go and see the boss together;<br> <grace class="BoldExamples"><b>alle, die ...</b></grace> all those who ...;<br> <grace class="BoldExamples"><b>der Kampf aller gegen alle</b></grace> unfettered competition;<br> <grace class="BoldExamples"><b>in allem einverstanden sein</b></grace> agree <i>or</i> be agreed on everything;<br> <grace class="BoldExamples"><b>von allem etwas nehmen</b></grace> take a bit of everything;<br> <grace class="BoldExamples"><b>er ist bei allem, was er tut, sehr genau</b></grace> he is very precise in everything he does;<br> <grace class="BoldExamples"><b>sie ist in allem sehr empfindlich</b></grace> she is very sensitive about everything; </grace><br><grace class="LetterArticle"><span class="Letter">c) </span>(<i>Neutr. Sg.: alle Beteiligten</i>) </grace><grace class="AllExamples"><grace class="BoldExamples"><b>alles mal herhören!</b></grace> (<i>ugs.</i>) listen everybody!; (<i>stärker befehlend</i>) everybody listen!;<br> <grace class="BoldExamples"><b>alles war nach Hause gegangen</b></grace> (<i>ugs.</i>) everyone <i>or</i> everybody had gone home;<br> <grace class="BoldExamples"><b>alles aussteigen!</b></grace> (<i>ugs.</i>) everyone <i>or</i> all out!; (<i>vom Schaffner gesagt</i>) all change!</grace><br>
</>

a, A
<link rel="stylesheet" type="text/css" href="stylesheet.css"><font size="-2">Duden-Oxford Deutsch-Englisch</font><br><grace class="SglMngArticle"><span class="WordHead"><b>a, A</b></span> <grace class="IPA">/a:/</grace> <i>das;</i> <b>a/A, a/A</b> </grace><br><br><grace class="LetterArticle"><span class="Letter">a) </span>(<i>Buchstabe</i>) a/A; </grace><grace class="AllExamples"><b>kleines a</b> small a;<br> <b>großes A</b> capital A;<br> <b>das A und O</b> (<i>fig.</i>) the essential thing/things (<i>Gen.</i> for);<br> <b>von A bis Z</b> (<i>fig. ugs.</i>) from beginning to end;<br> <b>wer A sagt, muss auch B sagen</b> (<i>fig.</i>) if one starts a thing, one must go through with it; </grace><br><grace class="LetterArticle"><span class="Letter">b) </span>(<i>Musik</i>) [key of] A</grace><br>
</>

整个故事:

我正在使用BeautifulSoup和正则表达式在python中制作基于HTML的字典。字典的结构主要是这样的:

标题| IPA

第1条
... 文章A
...... 所有示例(例如带有英语解释的德语示例)
...... <b>德语示例</b>
...... 英文解释;
...... <b>德语示例</b>
...... 英语<i>解释; </i>
......等等...
... B条
...... 所有示例
......等等...

为了通过CSS排列所有元素,我必须将CSS类分配给其中的每个元素(文章,示例...)。我曾经使用Regex查找和替换在纯记事本环境中完成所有这些操作。一切正常,除了我要逐块处理文本的事实,即我不希望正则表达式对工作部分的影响大于对正则表达式的影响。说出元素 AllExamples ,我首先给他们一个完整的类AllExamples,然后给出德语的例子和英语来解释不同的类,并在这些分号之后添加<br>英文说明。这并不容易,因为:

  1. 使用单一Regex查找和替换的纯记事本环境无法完成此操作。在Editpad Pro中,我可以通过正则表达式匹配整个 AllExample 类,然后在匹配的选择中使用第二个正则表达式将;替换为;<br>。如果要处理的实例很少,那很好,但是整个字典需要一键式批处理。

    我必须首先匹配区域的原因是,在我不想触摸的区域之外,有很多等效的图案。

  2. 结构中有例外。请注意第二个英文解释,最后是一个i标签,这是我在<br>之后添加;的正则表达式失败的地方。因此,在这种情况下,我必须将;<i> <b>替换为;<i><br> <b>。同样,由于该区域之外的那些等效模式,因此应首先匹配整个AllExample类。

因此,BeautifulSoup是解决方案,我可以轻松地将其与该区域匹配,并向其提供简单的.replace()。这里的问题是BeautifulSoup将标签和字符串视为完全不同的事物。但是,在我的情况下,标记</i><b>需要与;匹配,后者是一个字符串。

因此,我将标签和字符串混合在一起,然后像在记事本环境中那样进行查找和替换。 (我知道你们中有些人可能会在python中创建某个复杂的函数来执行此操作,但对我来说似乎很难。)

然后使用.replace_with()函数将其返回给BeautifulSoup,就像我在文章开头引用的主题一样。但是,当我执行此操作时,所有尖括号将变为结果打印中的&gt;。请问该怎么做才能解决这个问题?

此处的相关主题:
Python - Find text using beautifulSoup then replace in original soup variable

2 个答案:

答案 0 :(得分:2)

您的错误是在此处将HTML标签视为文本。您将BeatifulSoup对象树序列化为HTML字符串,对该字符串进行操作,然后将新的文本元素告知BeatifulSoup。文本元素(NavigableText)不是标签,任何类似HTML的内容都将被转义。您必须将文本反序列化回HTML结构。

处理此问题的“正确”方法是在正确的位置插入新标签。您的文字替换显示规则:

  1. <grace class="AllExamples">标记内,找到任何文本以<i>结尾且后跟;标记的<b>元素。
  2. 对于每个这样的元素,在其后插入一个<br/>

我只是在<i>标签内搜索<grace class="AllExamples">标签,然后进行过滤。找到匹配项后,使用Tag.insert_after()添加新的<br/>标签:

for emphasis in sparse_1.select('grace.AllExamples i'):
    # must have text that ends in ;
    if emphasis.string is None or not emphasis.string.endswith(';'):
        continue
    # must have a bold tag next
    next_tag = emphasis.find_next_sibling()
    if not next_tag or next_tag.name != 'b':
        continue
    # match confirmed, insert a break tag
    emphasis.insert_after(parse_1.new_tag('br'))

您也可以将文本检查和next_sibling折叠到一个生成器函数中,或者折叠到一个用于检查.find_all()操作中每个元素的函数中,但是以上内容可能是正确的如果需要进行相关的替换,则可以对此问题进行封装。

简而言之,不要将HTML视为大量文本,而应将其视为带有节点的有向树,其中节点是标签或文本元素。使用BeautifulSoup导航该树,然后在合适的位置时,根据需要通过添加或删除节点来操纵该树。

答案 1 :(得分:0)

将其转换为标签元素,创建新汤。

match = str(line).replace(";</i> <b>", ";</i><br> <b>")
newElement = BeautifulSoup(match, "html.parser")
line.replace_with(newElement)