如何从imdb borntoday页面中提取信息?
我看过这个问题,但那里没有回答。
Webscraping an IMDb page using BeautifulSoup
我尝试过以下代码
import urllib2
from bs4 import BeautifulSoup
test_url='https://m.imdb.com/feature/bornondate'
url=urllib2.urlopen(test_url)
html_text=url.read()
soup=BeautifulSoup(html_text)
poster=soup.find('a','poster')
print poster
print type(poster)
print type(soup)
print html_text
url.close()
我想在保持逻辑循环之前找到至少一个元素。
HTML页面内容如下。输出海报和类型(海报)给我没有。请帮助我在代码中缺少的地方。
<section class="posters list">
<h1>January 18</h1>
<a href="/name/nm0000126/" class="poster "><img src="https://images-na.ssl-images-amazon.com/images/M/MV5BMTQ0MDU1OTEyNF5BMl5BanBnXkFtZTgwNjI0MTk2MDE@._V1._CR0,0,419,618_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Kevin Costner</span><div class="detail">Actor, "Dances with Wolves"</div></div></a>
谢谢, Phani。
答案 0 :(得分:0)
这只会在第一页上给演员。如果你想要所有的演员/女演员,你将不得不使用selenium
或其他一些图书馆。
在单个数字时验证日期是0d
还是d
。
你可以在第一页上为演员/女演员用dryscrape
尝试这样的事情:
import re
import dryscrape
from bs4 import BeautifulSoup
from datetime import datetime
todays_date = datetime.today().strftime('%B %d')
test_url='https://m.imdb.com/feature/bornondate'
sess = dryscrape.Session()
sess.visit(test_url)
soup = BeautifulSoup(sess.body())
l1 = [a.strip() for a in soup.text.split('\n') if a.strip()]
idx = l1.index(todays_date)
l2 = [a.strip() for a in l1[idx+1].split(',')]
l3 = [re.sub(r'.*"', '',a) for a in l2]
l4 = [re.sub(r'(Actor|Actress)', r' \1', a) for a in l3]
l5 = [a for a in l4 if a.endswith('Actor') or a.endswith('Actress')]
l5
输出:
[u'Logan Lerman Actor',
u'Katey Sagal Actress',
u'Drea de Matteo Actress',
u'Tippi Hedren Actress',
u'Jodie Sweetin Actress',
u'Elizabeth Tulloch Actress',
u'Marsha Thomason Actress',
u'Erin Sanders Actress',
u'Mickey Sumner Actress']