Question

我正在研究旨在解析HTML的python代码。这里的目的是在每一行中找到字符串，并按如下所示对其进行更改：

原创：“ Criar Alerta”

<li><a href="http://..." target="_blank">Criar Alerta</a></li>

预期结果：“创建警报”

<li><a href="http://..." target="_blank">Create alert</a></li>

然后，为确保创建的HTML具有与原始HTML相同的结构，我需要逐行解析后面的字符串，识别字符串，然后从字典中将其更改为等效字符串。

我看到here，BeautifulSoup可以解析特定的标签。我尝试过，但是我不确定结果。

然后我问：考虑到它可以与标签一起使用，并且每行有多个标签，是否可以对BeautifulSoup进行逐行解析？

预先感谢

Tiago

Answer 1

@Jack Fleeting

在下面的示例中，我想将“Início”替换为“ Start”：

原文：

<li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Início</a></li>

预期结果：

<li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Start</a></li>

字典中的示例：

dict = {
    "Início": "Start",
    "Ajuda": "Help",
    "Criar Alerta": "Create Alert",
    "Materiais e Estruturas": "Structures and Materials"
    ...
}

下面是我编写的用于使用BeautifulSoup进行HTML解析的代码。（我注意到所有要替换的字符串都在“ a”标签内，然后我使用了SoupStrainer（“ a”））

from bs4 import BeautifulSoup
from bs4 import SoupStrainer

with open(html_file, 'rb') as src:
    doc = src.read()
    src.close()

only_a_tags = SoupStrainer("a")
parse_1 = 'html.parser'
soup = BeautifulSoup(doc, parse_1, parse_only=only_a_tags)

print(soup.prettify())

原始行的解析和打印如下：

<a href="index.html" style="color:#00233C;">
 <i class="icon icon-home">
 </i>
 Início
</a>

鉴于上面的内容，我不确定是否能够获得预期的结果。

我的目的是找到每行的字符串，然后在字典中搜索等效字符串，然后执行替换。

现在，我想知道如何使用BeatifulSoup执行这种字符串替换。之后，我将编写一个“ for”循环，以最终替换HTML文件中的所有行。

我的第一次尝试（是在了解BeautifulSoup之前）是处理HTML文件的.txt版本，该文件读取为二进制文件，这被证明非常耗时且无济于事。

Answer 2

我相信以下是您想要的。

让我们使用3行，其中两行包含字典中的单词，而另一行则不行-仅用于测试代码：

rep = """
      <li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Início</a></li>
      <li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Nunca</a></li>
      <li class="current"><a  style="color:#00233C;" href="index.html"><i class="icon icon-home"></i>  Criar Alerta</a></li>
    """

并使用您的字典（提示：将字典定义为dict从来不是一个好主意；它只是在麻烦中找麻烦...）

rep_dict = {
"Início": "Start",
"Ajuda": "Help",
"Criar Alerta": "Create Alert",
"Materiais e Estruturas": "Structures and Materials" 
}

现在输入代码：

soup = BeautifulSoup(rep, 'lxml')

only_a_tags = soup.find_all('a')

for item in range(len(only_a_tags)):
    for word in rep_dict:
        if word in str(only_a_tags[item]):
            print(str(only_a_tags[item]).replace(word,rep_dict[word]))

输出：

<a href="index.html" style="color:#00233C;"><i class="icon icon-home"></i>  Start</a>
<a href="index.html" style="color:#00233C;"><i class="icon icon-home"></i>  Create    Alert</a>

由于“ nunca”不在rep_dict中，因此未打印包含“ nunca”的项目。

HTML逐行解析

2 个答案: