我正在构建一个抓取工具来提取文章标题&网址。我尝试运行以下代码,但我在标题中得到错误。我需要定义一本词典吗?我做错了什么?
def get_page(page):
from urllib.request import urlopen
html = urlopen(page).read()
p = str(html, encoding='utf-8')
return p
def get_next_target(page):
start_link = page.find('title may-blank" href=')
start_quote = page.find('"', start_link)
end_quote = page.find ('"', start_quote + 1)
url = page[start_quote+1:end_quote] # Gets Article URL
start_title = page.find (">", end_quote)
end_title = page.find ("<", start_title)
title = page[start_title+1:end_title] # Gets Article Title
return title, url, end_quote
def print_all_links(page):
while True:
url, endpos = get_next_target(page)
if url:
print("%s, %s" % (title, url))
page = page[endpos:]
else:
break
reddit_url = 'http://www.reddit.com/r/worldnews'
print(print_all_links(reddit_url))
答案 0 :(得分:2)
get_next_target
函数返回一个包含3个元素的元组,但是你将它们解包为2个变量。你做了
title, url, endpos = get_next_target(page)
答案 1 :(得分:0)
你的问题在这里(正如另一个已经指出的那样):
def print_all_links(page):
while True:
url, endpos = get_next_target(page)
if url:
print("%s, %s" % (title, url))
page = page[endpos:]
else:
break
get_next_target(page)
返回3个elemens。
你需要这个
title, url, endpos = get_next_target(page)
而不是
url, endpos = get_next_target(page)