Question

我使用BeautifulSoup和Python 2.7从IMDB的移动网站上删除了以下代码。

我想为剧集编号'1'创建一个单独的对象，标题为'Winter is Coming'，IMDB得分为'8.9'。似乎无法弄清楚如何拆分剧集编号和标题。

   <a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
     <span class="text-large">
      1.
      <strong>
       Winter Is Coming
      </strong>
     </span>
     <br/>
     <span class="mobile-sprite tiny-star">
     </span>
     <strong>
      8.9
     </strong>
     17 Apr. 2011
    </a>

Answer 1

您可以使用find找到span text-large类，并找到您需要的特定元素。{1}}

获得所需的span后，您可以使用next抓住下一行，其中包含剧集编号和find，以找到包含标题的strong < / p>

html = """
<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
     <span class="text-large">
      1.
      <strong>
       Winter Is Coming
      </strong>
     </span>
     <br/>
     <span class="mobile-sprite tiny-star">
     </span>
     <strong>
      8.9
     </strong>
     17 Apr. 2011
    </a>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
span = soup.find('span', attrs={'text-large'})
ep = str(span.next).strip()
title = str(span.find('strong').text).strip()

print ep
print title

> 1. 
> Winter Is Coming

Answer 2

每个a class="btn-full"后，您可以使用span类来获取所需的标记，强标记是text-large类的跨度的子项，因此您只需要调用{ {1}}在标记上，对于css类.strong.text的范围，您需要找到下一个强标记，因为它是跨度的兄弟而不是孩子：

mobile-sprite tiny-star

这给了你：

h = """<a class="btn-full" href="/title/tt1480055?ref_=m_ttep_ep_ep1">
     <span class="text-large">
      1.
      <strong>
       Winter Is Coming
      </strong>
     </span>
     <br/>
     <span class="mobile-sprite tiny-star">
     </span>
     <strong>
      8.9
     </strong>
     17 Apr. 2011
    </a>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(h)
title = soup.select_one("span.text-large").strong.text.strip()
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()

print(title, score)

如果你真的想要获得这一集，最简单的方法就是将文本拆分一次：

(u'Winter Is Coming', u'8.9')

哪个会给你：

soup = BeautifulSoup(h)
ep, title = soup.select_one("span.text-large").text.split(None, 1)
score = soup.select_one("span.mobile-sprite.tiny-star").find_next("strong").text.strip()

print(ep, title.strip(), score)

Answer 3

使用url html抓取reguest和正则表达式搜索。

import os, sys, requests

frame = ('http://www.imdb.com/title/tt1480055?ref_=m_ttep_ep_ep1')
f = requests.get(frame)
helpme = f.text
import re
result = re.findall('itemprop="name" class="">(.*?)&nbsp;', helpme)
result2 = re.findall('"ratingCount">(.*?)</span>', helpme)
result3 = re.findall('"ratingValue">(.*?)</span>', helpme)
print result[0].encode('utf-8')
print result2[0]
print result3[0]

输出：

Winter Is Coming
24,474
9.0

使用BeautifulSoup解析IMDB

3 个答案: