Question

我无法从政府旅游建议网站上搜索我正在使用Python进行的研究项目的信息。

我选择了土耳其页面，但逻辑可以扩展到任何国家。

该网站是＆＃34; https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security＆＃34;

我使用的代码是：

import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-
and-security")
page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()

目前这是提取页面的所有html。在检查了网站后，我感兴趣的信息位于：

<div class="govuk-govspeak direction-ltr">
  <p>

有谁知道如何修改上面的代码只提取html的那部分？

由于

Answer 1

如果您只对位于govuk-govspeak direction-ltr课程内的数据感兴趣，那么您可以尝试以下步骤：

Beautiful Soup支持最常用的CSS selectors。只需将字符串传递给Tag对象的.select()方法或BeautifulSoup对象本身即可。 class使用.，id使用#

data = soup.select('.govuk-govspeak.direction-ltr')

# extract h3 tags
h3_tags = data[0].select('h3')
print(h3_tags)
[<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...]

#extract p tags
p3_tags = data[0].select('p')
[<p>The FCO advise against all travel to within 10 ...]

Answer 2

您可以找到特定的<p>，然后在该div下您可以找到import requests page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security") from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser') div=soup.find("div",{"class":"govuk-govspeak direction-ltr"}) data=[] for i in div.find_all("p"): data.append(i.get_text().encode("ascii","ignore")) data="\n".join(data)代码并获取此类数据

data

现在\n将包含由<h3>

分隔的段落的整个内容

注意：上面的代码只会为您提供不包含段落标题内容的文字内容

如果您希望两个标题都带有段落文字，那么您可以像这样提取<p>和import requests page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security") from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser') div=soup.find("div",{"class":"govuk-govspeak direction-ltr"}) data=[] for i in div: if i.name=="h3": data.append(i.get_text().encode("ascii","ignore")+"\n\n") if i.name=="p": data.append(i.get_text().encode("ascii","ignore")+"\n") data="".join(data)

\n\n

现在，数据将包含标题和段落，其中标题将由\n分隔，段落将由{{1}}分隔

Webscraping html的特定元素

2 个答案: