Question

这是一个“真实世界”HTML文件的片段，我正在尝试使用xml解析器与BeautifulSoup4（Python 3）一起使用（其他解析器不适用于那种脏的html文件）我正在努力）：

<html>
    <p> Hello </p>
    <a name='One'>Item One</a>
    <p> Text that I would like to scrape. </p>
    <p> More text I would like to scrape.
        <table>
            <tr>
                <td>
                    <a name='Two'>Item Two</a>
                </td>
            </tr>
        </table>
        A bunch of text that shouldn't be scraped.
        More text.
        And more text.
    </p>
</html>

我的目标是抓取<a name='One'>Item One</a>和<a name='Two'>Item Two</a>之间的所有文字，而不会删除上一个<p>中的3行文字。

我尝试使用<a>函数尝试遍历第一个find_next()标记然后调用get_text()，但是当我点击最后一个<p>时会发生什么最后的文字也被刮掉了，这不是我想要的。

示例代码：

tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
found = False
tag = tag_one
while found == False:
    tag = tag.find_next()
    if tag == tag_two:
        found = True
    print(tag.get_text())

关于如何解决这个问题的任何想法？

Answer 1

您可以使用find_all_next方法迭代下一个标记，并使用strings生成器获取每个标记的字符串列表。

soup = BeautifulSoup(html, 'xml')
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})
text = None

for tag in tag_one.find_all_next():
    if tag is tag_two:
        break
    strings = list(tag.stripped_strings)
    if strings and strings[0] != text:
        text = strings[0]
        print(text)

Answer 2

我提出了一种更强大的方法：

soup = BeautifulSoup(html, 'xml')
tag_one = soup.find('a', {'name': 'One'})
tag_two = soup.find('a', {'name': 'Two'})

for tag in tag_one.next_elements:
    if type(tag) is not bs4.element.Tag:
        print(tag)
    if tag is tag_two:
        break

使用BeautifulSoup4在不同级别的2个标签之间检索文本

2 个答案: