Question

例如，我正在尝试简化标签为<a href="https: evisa.mfa.am ">的网站中的数据，请查看此website

BeautifulSoup中有什么方法可以从非html标签提取数据？

这是上述链接中的整个html页面的代码段

<br/>2.&#32;Airlines&#32;must&#32;provide&#32;advance&#32;passenger&#32;information&#32;of&#32;scheduled&#32;arrival&#32;of&#32;nationals&#32;of&#32;Antigua&#32;and&#32;Barbuda&#32;and&#32;resident&#32;diplomats.&#32;<br/><br/><b>ARGENTINA</b>&#32;-&#32;published&#32;02.04.2020&#32;<br/>Passengers&#32;are&#32;not&#32;allowed&#32;to&#32;enter&#32;Argentina&#32;until&#32;12&#32;April&#32;2020.<br/><br/><b>ARMENIA</b>&#32;-&#32;published&#32;22.03.2020&#32;<br/>1.&#32;Nationals&#32;of&#32;China&#32;(People's&#32;Rep.)&#32;with&#32;a&#32;normal&#32;passport&#32;are&#32;no&#32;longer&#32;visa&#32;exempt.&#32;<br/>2.&#32;Nationals&#32;of&#32;Iran&#32;can&#32;no&#32;longer&#32;obtain&#32;a&#32;visa&#32;on&#32;arrival.&#32;They&#32;must&#32;obtain&#32;a&#32;visa&#32;or&#32;an&#32;e-visa&#32;prior&#32;to&#32;their&#32;arrival&#32;in&#32;Armenia.&#32;The&#32;e-visa&#32;can&#32;be&#32;obtained&#32;at&#32;<a&#32;href="https://evisa.mfa.am/">https://evisa.mfa.am/</a>&#32;<br/>3.&#32;Passengers&#32;who&#32;have&#32;been&#32;in&#32;Austria,&#32;Belgium,&#32;China&#32;(People's&#32;Rep.),&#32;Denmark,&#32;France,&#32;Germany,&#32;Iran,&#32;Italy,&#32;Japan,&#32;Korea&#32;(Rep.),&#32;Netherlands,&#32;Norway,&#32;Spain,&#32;Sweden,&#32;Switzerland&#32;or&#32;United&#32;Kingdom&#32;in&#32;the&#32;past&#32;14&#32;days&#32;are&#32;not&#32;allowed&#32;to&#32;enter&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;nationals&#32;or&#32;residents&#32;of&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;spouses&#32;or&#32;children&#32;of&#32;nationals&#32;of&#32;Armenia.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;employees&#32;of&#32;foreign&#32;diplomatic&#32;missions&#32;and&#32;consular&#32;institutions.<br/>-&#32;This&#32;does&#32;not&#32;apply&#32;to&#32;representations&#32;of&#32;official&#32;international&#32;missions&#32;or&#32;organizations.<br/>4.&#32;Nationals&#32;of&#32;Armenia&#32;who&#32;have&#32;been&#32;in&#32;Austria,&#32;Belgium,&#32;China&#32;(People's&#32;Rep.),&#32;Denmark,&#32;France,&#32;Germany,&#32;Iran,&#32;Italy,&#32;Japan,&#32;Korea&#32;(Rep.),&#32;Netherlands,&#32;Norway,&#32;Spain,&#32;Sweden,&#32;Switzerland&#32;or&#32;United&#32;Kingdom&#32;in&#32;the&#32;past&#32;14&#32;days&#32;must&#32;undergo&#32;14-days&#32;of&#32;quarantine&#32;or&#32;self-isolation&#32;regime.

Answer 1

这称为AMP字符，您可以看一下here来了解它是什么。

请勿使用html.parser。只需使用真实的parser，例如lxml或html5lib

from bs4 import BeautifulSoup
import requests

r = requests.get(
    "https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm")
soup = BeautifulSoup(r.content, 'html5lib')


print(soup.prettify())

Answer 2

如果您使用requests解析网页，请删除标记中的错误部分，则可以将其传递给BeautifulSoup。

在下面，我将替换 ，因为它只是一个空格的HTML表示。

import requests
url = 'https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm'

response = requests.get(url)
content = response.text.replace('&#32;',' ')

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

现在您也可以使用BeautifulSoup。

Answer 3

在发布问题之前，您必须分析html代码。

现在尝试获取您的URL

from bs4 import BeautifulSoup

with open("test.html","r") as f:
    page = f.read()
    soup = BeautifulSoup(page, 'html.parser')
    url = soup.findAll("a&#32;href=\"https:")
    print(url)

如何使用BeautifulSoup抓取非HTML标签

3 个答案: