我正在尝试使用以下BeautifulSoup脚本找到前30个TED视频(视频名称和URL):
import urllib2
from BeautifulSoup import BeautifulSoup
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
soup = BeautifulSoup(page)
link = soup.findAll(lambda tag: tag.name == 'a' and tag.findParent('dt', 'thumbnail'))
outfile = open("test.html", "w")
print >> outfile, """<head>
<head>
<title>TED Talks Index</title>
</head>
<body>
<br><br><center>
<table cellpadding=15 cellspacing=0 style='border:1px solid #000;'>"""
print >> outfile, "<tr><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'><b>###</b></th><th style='border-bottom:2px solid #E16543; border-right:1px solid #000;'>Name</th><th style='border-bottom:2px solid #E16543;'>URL</th></tr>"
ted_link = 'http://www.ted.com/'
for anchor in link:
print >> outfile, "<tr style='border-bottom:1px solid #000;'><td style='border-right:1px solid #000;'>%s</td><td style='border-right:1px solid #000;'>%s</td><td>http://www.ted.com%s</td></tr>" % (count, anchor['title'], anchor['href'])
count = count + 1
print >> outfile, """</table>
</body>
</html>"""
page_count = page_count + 1
代码看起来很好,减去两件事:
计数似乎没有增加。它只通过并找到第一页的内容,即:前十个,而不是三十个视频。为什么呢?
这段代码给了我很多错误。我不知道如何在逻辑上实现我想要的东西(使用urlopen(“%s%d”):
代码:
total_pages = 3
page_count = 1
count = 1
url = 'http://www.ted.com/talks?page='
while page_count < total_pages:
page = urllib2.urlopen("%s%d") %(url, page_count)
答案 0 :(得分:1)
首先,简化循环并消除一些变量,在这种情况下相当于样板:
for pagenum in xrange(1, 4): # The 4 is annoying, write it as 3+1 if you like.
url = "http://www.ted.com/talks?page=%d" % pagenum
# do stuff with url
但是让我们在循环之外打开文件,而不是每次迭代重新打开它。这就是为什么你只看到10个结果:按照你的想法,会话11-20而不是前10个。 (这将是21-30,除了你在page_count < total_pages
上进行循环,它只处理了前两页。)
立即收集所有链接,然后写入输出。我已经删除了HTML样式,这也使得代码更容易理解;相反,使用CSS,可能是内联&lt; style&gt;元素,或者如果你愿意,可以添加它。
import urllib2
from cgi import escape # Important!
from BeautifulSoup import BeautifulSoup
def is_talk_anchor(tag):
return tag.name == "a" and tag.findParent("dt", "thumbnail")
links = []
for pagenum in xrange(1, 4):
soup = BeautifulSoup(urllib2.urlopen("http://www.ted.com/talks?page=%d" % pagenum))
links.extend(soup.findAll(is_talk_anchor))
out = open("test.html", "w")
print >>out, """<html><head><title>TED Talks Index</title></head>
<body>
<table>
<tr><th>#</th><th>Name</th><th>URL</th></tr>"""
for x, a in enumerate(links):
print >>out, "<tr><td>%d</td><td>%s</td><td>http://www.ted.com%s</td></tr>" % (x + 1, escape(a["title"]), escape(a["href"]))
print >>out, "</table>"
# Or, as an ordered list:
print >>out, "<ol>"
for a in links:
print >>out, """<li><a href="http://www.ted.com%s">%s</a></li>""" % (escape(a["href"], True), escape(a["title"]))
print >>out, "</ol>"
print >>out, "</body></html>"