Question

我正在使用BeautifulSoup4进行一些HTML抓取。我正在尝试提取重要信息，例如标题，元数据，段落和列出的信息。

我的问题是我可以这样写段落：

def main():
    response = urllib.request.urlopen('https://ecir2019.org/industry-day/')
    html = response.read()
    soup = BeautifulSoup(html,features="html.parser")
    text = [e.get_text() for e in soup.find_all('p')]
    article = '\n'.join(text)


    print(article)

main()

但是，如果我的网站链接的正文中有项目符号，它将包含导航栏。即如果我将p更改为li或ul

例如，我想要获得的输出是：

The Industry Day's objectives are three-fold:

The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.
The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.
Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.

我实际上得到的是： The Industry Day's objectives are three-fold:

HTML来源中的标记：

<p>The Industry Day's objectives are three-fold:</p>
<ol>
<li>The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.</li>
<li>The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.</li>
<li>Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.</li>
</ol>

Answer 1

您可以使用Or css选择器语法，因此也可以选择li元素。

import requests
from bs4 import BeautifulSoup

url = 'https://ecir2019.org/industry-day/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('p, ol li')]

print(items)

仅此部分：

import requests
from bs4 import BeautifulSoup

url = 'https://ecir2019.org/industry-day/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.kg-card-markdown p:nth-of-type(2), .kg-card-markdown p:nth-of-type(2) + ol li')]

print(items)

页面似乎已更改，因此我使用的是缓存版本（这仅在更新缓存之前有效）。您可以使用附加的类选择器来限制帖子正文：

import requests
from bs4 import BeautifulSoup

url = 'http://webcache.googleusercontent.com/search?q=cache:https://ecir2019.org/industry-day'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.post-body p, .post-body ol li, .post-body ul li')]

print(items)

BeautifulSoup：HTML提取项目符号点，但不提取导航栏

1 个答案: