嗨,这可能看起来像转贴但不是。我最近发布了一个类似的问题,但这是另一个与该问题相关的问题。从上一个问题(LXML unable to retrieve webpage with error "failed to load HTTP resource")可以看出,如果链接是文件的第一行,我现在可以阅读并打印文章。但是,一旦我尝试多次这样做,它就会返回错误(http://tinypic.com/r/2rr2mau/8)
import lxml.html
def fetch_article_content_cna (i):
BASE_URL = "http://channelnewsasia.com"
f = open('cnaurl2.txt')
line = f.readlines()
print line [i]
url = urljoin(BASE_URL, line[i])
t = lxml.html.parse(url)
#print t.find(".//title").text
content = '\n'.join(t.xpath('.//div[@class="news_detail"]/div/p/text()'))
return content
cnaurl2.txt
/news/world/tripoli-fire-rages-as/1287826.html
/news/asiapacific/korea-ferry-survivors/1287508.html
答案 0 :(得分:0)
尝试:
url = urljoin(BASE_URL, line[i].strip())