Question

在以下链接中：

https://en.wikipedia.org/wiki/America

我需要抓取h2，h3和p标签内的内容。但是，我想忽略标题和内容：

＆＃34;另见＆＃34;
＆＃34;注释＆＃34;
＆＃34;参考文献＆＃34;
忽略所有表格/网址

我如何在美丽的汤中实现这一目标？我目前的代码如下：

    def open_document():
    for i in range (1, 1+1):
        with open(directory_of_raw_documents + str(i), "r") as document:
            html = document.read()
            soup = BeautifulSoup(html, "html.parser")
            body = soup.find('div', id='bodyContent')
            results = ""
            for item in body.find_all(['h2','h3','p']):
                results += item.get_text() + "\n"
                results = results.replace("[edit]","")
            print(results)

open_document()

我想要的输出在任何表格中都没有任何内容，请参阅全部，注释或参考部分。我宁愿不在Python 2.7中使用Wikipedia模块

Answer 1

soup.find(something)

意味着你在整个文档中找到了一些东西，如果你想忽略一些内容，你需要缩小范围，在这种情况下，你可以使用：

soup.find(id = 'bodyContent') #this narrow the scope to the main content.

比你可以使用find_all：

soup.find(id = 'bodyContent').find_all(name=['h2','h3','p'], href=False)

使用Beautiful Soup忽略维基百科中某些ID以下的内容

1 个答案: