我正在使用python请求库提取网站中包含的HTML代码。然后,我需要从这些HTML代码中获取一些信息。但是不知何故我没有得到这些数据。如何获得这些数据?
<span data-testid="vuln-cvssv2-additional">
Victim must voluntarily interact with attack mechanism
<br/>
Allows unauthorized disclosure of information
<br/>
Allows unauthorized modification
<br/>
</span>
import requests
import re
link = "https://nvd.nist.gov/vuln/detail/CVE-2017-10119"
f = requests.get(link)
deneme = str(f.text)
re_base_vector = r'\<span data-testid\s*\=\s*\"vuln-cvssv2- additional"\s*\>(.*?(\n))+.*?\n\<\\span\>'
find_base_vector = re.search(re_base_vector, deneme)
print(find_base_vector)
print(find_base_vector.group(0))
Victim must voluntarily interact with attack mechanism.
Allows unauthorized disclosure of information.
Allows unauthorized modification
答案 0 :(得分:2)
Regex通常是a bad idea和HTML。使用BeautifulSoup使用HTML解析器读取它,然后使用属性选择器:
soup.select_one("span[data-testid='vuln-cvssv2-additional']")
例如
import requests
from bs4 import BeautifulSoup
html='''
<span data-testid="vuln-cvssv2-additional">
Victim must voluntarily interact with attack mechanism
<br/>
Allows unauthorized disclosure of information
<br/>
Allows unauthorized modification
<br/>
</span>
'''
soup = BeautifulSoup(html, "lxml")
item = soup.select_one("span[data-testid='vuln-cvssv2-additional']").text
print(item)
答案 1 :(得分:0)
BeautifulSoup将帮助您更好地解析和浏览html。轻松解析给定的html。