Question

如何从imdb borntoday页面中提取信息？

我看过这个问题，但那里没有回答。

Webscraping an IMDb page using BeautifulSoup

我尝试过以下代码

import urllib2
from bs4 import BeautifulSoup

test_url='https://m.imdb.com/feature/bornondate'

url=urllib2.urlopen(test_url)
html_text=url.read()

soup=BeautifulSoup(html_text)

poster=soup.find('a','poster')
print poster
print type(poster)
print type(soup)
print html_text
url.close()

我想在保持逻辑循环之前找到至少一个元素。

HTML页面内容如下。输出海报和类型（海报）给我没有。请帮助我在代码中缺少的地方。

<section class="posters list">
<h1>January 18</h1>

<a href="/name/nm0000126/" class="poster "><img src="https://images-na.ssl-images-amazon.com/images/M/MV5BMTQ0MDU1OTEyNF5BMl5BanBnXkFtZTgwNjI0MTk2MDE@._V1._CR0,0,419,618_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Kevin Costner</span><div class="detail">Actor, "Dances with Wolves"</div></div></a>

谢谢， Phani。

Answer 1

这只会在第一页上给演员。如果你想要所有的演员/女演员，你将不得不使用selenium或其他一些图书馆。

在单个数字时验证日期是0d还是d。

你可以在第一页上为演员/女演员用dryscrape尝试这样的事情：

import re
import dryscrape
from bs4 import BeautifulSoup
from datetime import datetime

todays_date = datetime.today().strftime('%B %d')

test_url='https://m.imdb.com/feature/bornondate'
sess = dryscrape.Session()
sess.visit(test_url)
soup = BeautifulSoup(sess.body())
l1 = [a.strip() for a in soup.text.split('\n') if a.strip()]
idx = l1.index(todays_date)
l2 = [a.strip() for a in l1[idx+1].split(',')]
l3 = [re.sub(r'.*"', '',a) for a in l2]
l4 = [re.sub(r'(Actor|Actress)', r' \1', a) for a in l3]
l5 = [a for a in l4 if a.endswith('Actor') or a.endswith('Actress')]
l5

输出：

[u'Logan Lerman Actor',
 u'Katey Sagal Actress',
 u'Drea de Matteo Actress',
 u'Tippi Hedren Actress',
 u'Jodie Sweetin Actress',
 u'Elizabeth Tulloch Actress',
 u'Marsha Thomason Actress',
 u'Erin Sanders Actress',
 u'Mickey Sumner Actress']

beautifulsoup imdb borntoday页面

1 个答案: