我正在学习Python和Beautiful Soup,作为练习,我会在网页上编写一个测试网页。我的目标是从网页中提取网址,然后按照此网址提取另一个网址。
我的代码如下:
第一步:
path = "http://python-data.dr-chuck.net/known_by_Fikret.html"
pattern = re.compile(r'"(.+)"')
page = urllib2.urlopen(path)
soup = bs(page, 'lxml')
a = soup.find_all("a")
path = re.search(pattern, str(a[2])).group(0)
path
输出:
'"http://python-data.dr-chuck.net/known_by_Montgomery.html"'
第二步:
page = urllib2.urlopen(path)
soup = bs(page, 'lxml')
a = soup.find_all("a")
path = re.search(pattern, str(a[2])).group(0)
path
输出:
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-33-14ad9508aea0> in <module>()
----> 1 page = urllib2.urlopen(path)
2 soup = bs(page, 'lxml')
3 a = soup.find_all("a")
4 path = re.search(pattern, str(a[2])).group(0)
5 path
C:\users\alex\Anaconda2\lib\urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
C:\users\alex\Anaconda2\lib\urllib2.pyc in open(self, fullurl, data, timeout)
427 req = meth(req)
428
--> 429 response = self._open(req, data)
430
431 # post-process response
C:\users\alex\Anaconda2\lib\urllib2.pyc in _open(self, req, data)
450
451 return self._call_chain(self.handle_open, 'unknown',
--> 452 'unknown_open', req)
453
454 def error(self, proto, *args):
C:\users\alex\Anaconda2\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result
C:\users\alex\Anaconda2\lib\urllib2.pyc in unknown_open(self, req)
1264 def unknown_open(self, req):
1265 type = req.get_type()
-> 1266 raise URLError('unknown url type: %s' % type)
1267
1268 def parse_keqv_list(l):
URLError: <urlopen error unknown url type: "http>
为什么urlopen无法识别网址?
您的建议将不胜感激。
答案 0 :(得分:2)
我想这个问题是你在path
'"http://python-data.dr-chuck.net/known_by_Montgomery.html"'
使用strip()
作为
path = path.strip('"')
page = urllib2.urlopen(path)
您可以使用BeautifulSoup从锚标记中提取src。您无需为此目的使用正则表达式
示例强>
>>> html = """<a href="http://www.google.com">"""
>>> soup.find_all('a')[0]['href']
'http://www.google.com'
答案 1 :(得分:1)
检索正则表达式匹配的结果时使用.group(1)
。 .group(0)
返回包含引号的整个匹配字符串。
答案 2 :(得分:1)
path.strip('"')
出:
'http://python-data.dr-chuck.net/known_by_Montgomery.html'
网址不正确,只需剥离网址中的"
或调整正则表达式
答案 3 :(得分:1)
您的问题是因为您在网址中有"
。删除它。
但是BeautifulSoup有自己的方法来获取网址 - a[2]['href']
from bs4 import BeautifulSoup as bs
import urllib2
# - first page -
path = "http://python-data.dr-chuck.net/known_by_Fikret.html"
page = urllib2.urlopen(path)
soup = bs(page, 'lxml')
all_links = soup.find_all("a")
#for link in all_links:
# print link['href']
print all_links[2]['href']
# - second page -
path = all_links[2]['href']
page = urllib2.urlopen(path)
soup = bs(page, 'lxml')
all_links = soup.find_all("a")
#for link in all_links:
# print link['href']
print all_links[2]['href']
或更短
from bs4 import BeautifulSoup as bs
import urllib2
def get_url(path):
page = urllib2.urlopen(path)
soup = bs(page, 'lxml')
all_links = soup.find_all("a")
#for link in all_links:
# print link['href']
return all_links[2]['href']
# - first page -
path = "http://python-data.dr-chuck.net/known_by_Fikret.html"
path = get_url(path)
print path
# - second page -
path = get_url(path)
print path