我有一个来自BeautifulSoup的分析得出的html,我想提取以下star0
:sa2
。
>>>short_comment[1]['name']
<div class="author">
<a href="/member/?id=59465221" target="_blank">唐牛</a>
<span class="star0 sa2"></span></div>
我尝试了with regex: star0\s[a-zA-Z0-9]
的一件事,但是什么也没回来。现在,我尝试用<
替换并在最后一个字符串上将字符串分开:
>>> s = s.replace('<','>')
>>> s.split('>')
['', 'div class="author"', ' ', 'a href="/member/?id=59465221" target="_blank"', '唐牛', '/a', ' ', 'span class="star0 sa2"', '', '/span', '', '/div', '']
>>> s.find("star0")
我还尝试使用BS4从与“作者”类匹配的元素中删除该类。
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0'}
base_url = 'https://www.nosetime.com'
def get_perfume_as_dict(url):
print(base_url + url)
response_unicode = requests.get(base_url + url, headers=headers)
soup = BeautifulSoup(response_unicode.text, 'html.parser')
perfume = {}
perfume["short_comment"] = [
{"name": name.text,
"rating": name.span['class'][1],
"comment": comment.text} for
name,
comment in zip(
soup.find_all('div', {'class':"author"}),
soup.find_all('div', {'class':"hfshow1"}),
)
] #soup.find('li', {'id':'itemcomment'}) # soup.find_all('span ', {'class':'fav_cnt'})
但是当我启动它时,它似乎陷入了一个循环:
get_perfume_as_dict("/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html")
答案 0 :(得分:0)
使用BeautifulSoup
查询您的html
例如:
from bs4 import BeautifulSoup
short_comment = """<div class="author">
<a href="/member/?id=59465221" target="_blank">唐牛</a>
<span class="star0 sa2"></span></div>"""
soup = BeautifulSoup(short_comment, "html.parser")
print(soup.find("div", {'class':'author'}).span['class'])
输出:
['star0', 'sa2']