Question

我正在尝试使用Beautiful Soup来收集rottentomatoes.com的电影报价。页面源是有趣的，因为引号直接由span类＆＃34;粗体quote_actor＆＃34;继续，但引用本身是在没有类的范围内，例如（https://www.rottentomatoes.com/m/happy_gilmore/quotes/）： screenshot of web source

我想使用Beautiful Soup的find_all来捕捉所有引号，而没有演员的名字。我尝试了许多没有成功的事情，例如：

＆＃13;

moviequotes = soup(input)
for t in web_soup.findAll('span', {'class':'bold quote_actor'}):
    for item in t.parent.next_siblings:
        if isinstance(item, Tag):
            if 'class' in item.attrs and 'name' in item.attrs['class']:
                break
            print (item)

＆＃13;

我非常感谢有关如何导航此代码以及将生成的纯文本引号定义为与Pandas等一起使用的对象的任何提示。

Answer 1

我正在使用CSS选择器查找包含引号的spans：div span + span。这会找到span内的任何div元素，并且具有span类型的直接同级元素。

这样我也会得到包含演员姓名的span，所以我会通过检查他们是否有class或style属性来过滤它们。

import bs4
import requests

url  = 'https://www.rottentomatoes.com/m/happy_gilmore/quotes/'
page = requests.get(url).text
soup = bs4.BeautifulSoup(page, 'lxml')

# CSS selector
selector = 'div span + span'

# find all the span elements which are a descendant of a div element
# and are a direct sibling of another span element 
quotes = soup.select(selector)

# now filter out the elements with actor names
data = []

for q in quotes:
    # only keep elements that don't have a class or style attribute
    if not (q.has_attr('class') or q.has_attr('style')):
        data.append(q)

for d in data:
    print(d.text)

美丽的汤 - 选择下一个跨度元素的文本，没有类

1 个答案: