Question

通过使用bs4的html时，我遇到了打破for循环的问题。我想保存一个用标题分隔的列表。 HTML代码可能如下所示，但它包含所需标记之间的更多信息：

<h2>List One</h2>
<td class="title">
    <a title="Title One">This is Title One</a>
</td>
<td class="title">
    <a title="Title Two">This is Title Two</a>
</td>
<h2>List Two</h2>
<td class="title">
    <a title="Title Three">This is Title Three</a>
</td>
<td class="title">
    <a title="Title Four">This is Title Four</a>
</td>

我希望打印结果如下：

List One
This is Title One
This is Title Two
List Two
This is Title Three
This is Title Four

我的脚本已经走到了这一步：

import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('some webiste')
soup = BeautifulSoup(html, "lxml")

quote1 = soup.h2
print quote1.text

quote2 = quote1.find_next_sibling('h2')
print quote2.text

for quotes in soup.findAll('h2'):
    if quotes.find(text=True) == quote2.text:
        break
    if quotes.find(text=True) == quote1.text:
        for anchor in soup.findAll('td', {'class':'title'}):
            print anchor.text
            print quotes.text

当找到“quote2”（列表二）时，我试图打破循环。但是脚本获取了所有td内容并忽略了下一个h2-tags。那么如何用下一个h2-tag打破for循环？

Answer 1

在我看来，问题在于你的HTML语法。根据{{3}}，将“td”和“h3”（或通常是任何标题标签）混合起来是不合法的。此外，使用表格实现列表很可能不是一个好习惯。

如果您可以操作输入文件，您似乎需要的列表可以使用“ul”和“li”标签实现（首先在'ul'中包含标题的'li'），或者，如果您需要使用表格，只需将标题放在“td”标记内，或者更加干净地用“th”s：

<table>
<tr>
    <th>Your title</th>
</tr>
<tr>
    <td>Your data</td>
</tr>
</table>

如果输入不在您的控制之下，您的脚本可以执行搜索并替换输入文本，将标题放入表格单元格或列表项目。

python BeautifulSoup4在找到标记时中断循环

1 个答案: