例如,我正在尝试简化标签为<a href="https: evisa.mfa.am ">
的网站中的数据,请查看此website
BeautifulSoup中有什么方法可以从非html标签提取数据?
这是上述链接中的整个html页面的代码段
<br/>2. Airlines must provide advance passenger information of scheduled arrival of nationals of Antigua and Barbuda and resident diplomats. <br/><br/><b>ARGENTINA</b> - published 02.04.2020 <br/>Passengers are not allowed to enter Argentina until 12 April 2020.<br/><br/><b>ARMENIA</b> - published 22.03.2020 <br/>1. Nationals of China (People's Rep.) with a normal passport are no longer visa exempt. <br/>2. Nationals of Iran can no longer obtain a visa on arrival. They must obtain a visa or an e-visa prior to their arrival in Armenia. The e-visa can be obtained at <a href="https://evisa.mfa.am/">https://evisa.mfa.am/</a> <br/>3. Passengers who have been in Austria, Belgium, China (People's Rep.), Denmark, France, Germany, Iran, Italy, Japan, Korea (Rep.), Netherlands, Norway, Spain, Sweden, Switzerland or United Kingdom in the past 14 days are not allowed to enter Armenia.<br/>- This does not apply to nationals or residents of Armenia.<br/>- This does not apply to spouses or children of nationals of Armenia.<br/>- This does not apply to employees of foreign diplomatic missions and consular institutions.<br/>- This does not apply to representations of official international missions or organizations.<br/>4. Nationals of Armenia who have been in Austria, Belgium, China (People's Rep.), Denmark, France, Germany, Iran, Italy, Japan, Korea (Rep.), Netherlands, Norway, Spain, Sweden, Switzerland or United Kingdom in the past 14 days must undergo 14-days of quarantine or self-isolation regime.
答案 0 :(得分:2)
这称为AMP
字符,您可以看一下here来了解它是什么。
请勿使用html.parser
。只需使用真实的parser
,例如lxml
或html5lib
from bs4 import BeautifulSoup
import requests
r = requests.get(
"https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm")
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
答案 1 :(得分:1)
如果您使用requests
解析网页,请删除标记中的错误部分,则可以将其传递给BeautifulSoup。
在下面,我将替换 
,因为它只是一个空格的HTML表示。
import requests
url = 'https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm'
response = requests.get(url)
content = response.text.replace(' ',' ')
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')
现在您也可以使用BeautifulSoup。
答案 2 :(得分:0)
在发布问题之前,您必须分析html代码。
现在尝试获取您的URL
from bs4 import BeautifulSoup
with open("test.html","r") as f:
page = f.read()
soup = BeautifulSoup(page, 'html.parser')
url = soup.findAll("a href=\"https:")
print(url)