获取字符串中另一个元素之后的元素

时间:2020-11-04 13:58:57

标签: python python-3.x text

我有一个来自BeautifulSoup的分析得出的html,我想提取以下star0 sa2

>>>short_comment[1]['name']

<div class="author">
   <a href="/member/?id=59465221" target="_blank">唐牛</a>
    <span class="star0 sa2"></span></div>

我尝试了with regex: star0\s[a-zA-Z0-9]的一件事,但是什么也没回来。现在,我尝试用<替换并在最后一个字符串上将字符串分开:

>>> s = s.replace('<','>')
>>> s.split('>')
['', 'div class="author"', ' ', 'a href="/member/?id=59465221" target="_blank"', '唐牛', '/a', ' ', 'span class="star0 sa2"', '', '/span', '', '/div', '']
>>> s.find("star0")

我还尝试使用BS4从与“作者”类匹配的元素中删除该类。

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0'}
base_url = 'https://www.nosetime.com'

def get_perfume_as_dict(url):
  print(base_url + url)
  response_unicode = requests.get(base_url + url, headers=headers)
  soup = BeautifulSoup(response_unicode.text, 'html.parser')
  perfume = {}
  perfume["short_comment"] = [
                              {"name": name.text,
                               "rating": name.span['class'][1],
                               "comment": comment.text} for 
                              name,
                              comment in zip(
                                  soup.find_all('div', {'class':"author"}), 
                                  soup.find_all('div', {'class':"hfshow1"}), 
                                  )
                              ] #soup.find('li', {'id':'itemcomment'}) # soup.find_all('span ', {'class':'fav_cnt'})

但是当我启动它时,它似乎陷入了一个循环:

get_perfume_as_dict("/xiangshui/350870-oulong-atelier-cologne-oolang-infini.html")

1 个答案:

答案 0 :(得分:0)

使用BeautifulSoup查询您的html

例如:

from bs4 import BeautifulSoup

short_comment = """<div class="author">
   <a href="/member/?id=59465221" target="_blank">唐牛</a>
    <span class="star0 sa2"></span></div>"""
   
soup = BeautifulSoup(short_comment, "html.parser")
print(soup.find("div", {'class':'author'}).span['class'])

输出:

['star0', 'sa2']
相关问题