我正在为其演员的成员抓取IMDB(IMDB API没有全面的演员/演职员表数据)。我想要的最终产品是一个包含三列的表,它从网页中的所有表中获取数据并按如下方式对它们进行排序:
Produced by | Gary Kurtz | producer
Produced by | George Lucas | executive producer
Music by | John Williams |
(以星球大战为例,http://www.imdb.com/title/tt0076759/fullcredits?ref_=tt_cl_sm#cast)
以下代码几乎就在那里,但是有大量不必要的空格,而.parent函数肯定被错误地使用了。在表格上方找到h4值的最佳方法是什么?
这是代码。
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html5lib')
soup.prettify()
with open(fname, 'r') as f:
soup = BeautifulSoup(f.read(),'html5lib')
soup.prettify()
for child in soup.find_all('td',{'class':'name'}):
print child.parent.text, child.parent.parent.parent.parent.parent.parent.text.encode('utf-8')
我正在尝试从这些h4标题中获取诸如“Directed by”之类的值
答案 0 :(得分:1)
欢迎使用stackoverflow。您似乎可以同时找到h4
和table
,因为它们在html中显示为一对,因此您可以将它们压缩以循环覆盖它们。之后,您只需获取并格式化文本。将您的代码更改为:
soup = BeautifulSoup(f.read(), 'html5lib')
for h4,table in zip(soup.find_all('h4'),soup.find_all('table')):
header4 = " ".join(h4.text.strip().split())
table_data = [" ".join(tr.text.strip().replace("\n", "").replace("...", "|").split()) for tr in table.find_all('tr')]
print("%s | %s \n")%(header4,table_data)
这将打印:
Directed by | [u'George Lucas']
Writing Credits | [u'George Lucas | (written by)']
Cast (in credits order) verified as complete | ['', u'Mark Hamill | Luke Skywalker', u'Harrison Ford | Han Solo', u'Carrie Fisher | Princess Leia Organa', u'Peter Cushing | Grand Moff Tarkin',...]
Produced by | [u'Gary Kurtz | producer', u'George Lucas | executive producer', u'Rick McCallum | producer (1997 special version)']
Music by | [u'John Williams']
...
答案 1 :(得分:0)
这将避免彻底使用父函数
from urllib.request import urlopen
from bs4 import BeautifulSoup
#this will find all headers eg produced by
def get_header(url):
bsObj = BeautifulSoup(urlopen(url))
headers = bsObj.find("div", {"id":"fullcredits_content"}).findAll("h4", {"class":"dataHeaderWithBorder"})
return headers
#this will find all names eg gary kurtz
def get_table(url):
bsObj = BeautifulSoup(urlopen(url))
table = bsObj.findAll("td", {"class":"name"})
return table
url = "http://www.imdb.com/title/tt0076759/fullcredits"
header= get_header(url)
table = get_table(url)
#title = get_title(url)
for h in header:
for t in table:
print(h.get_text())
print(t.get_text())
print("............")