嗨,所以我尝试在浏览器中打开下面的链接并且它可以工作但不在代码中。该链接实际上是新闻网站的组合,然后是从另一个文件url.txt调用的文章的扩展名。我在普通网站(www.google.com)上尝试使用该代码,效果非常好。
import sys
import MySQLdb
from mechanize import Browser
from bs4 import BeautifulSoup, SoupStrainer
from nltk import word_tokenize
from nltk.tokenize import *
import urllib2
import nltk, re, pprint
import mechanize #html form filling
import lxml.html
with open("url.txt","r") as f:
first_line = f.readline()
#print first_line
url = "http://channelnewsasia.com/&s" + (first_line)
t = lxml.html.parse(url)
print t.find(".//title").text
这是我得到的错误。
http://tinypic.com/r/2cfd460/8
这是url.txt的内容
/news/asiapacific/australia-to-send-armed/1284790.html
答案 0 :(得分:1)
这是因为网址的&s
部分 - 绝对不需要:
url = "http://channelnewsasia.com" + first_line
此外,最好使用urljoin()
加入网址部分:
from urlparse import urljoin
import lxml.html
BASE_URL = "http://channelnewsasia.com"
with open("url.txt") as f:
first_line = f.readline()
url = urljoin(BASE_URL, first_line)
t = lxml.html.parse(url)
print t.find(".//title").text
打印:
Australia to send armed personnel to MH17 site - Channel NewsAsia