Google Scraping href值

时间:2018-01-11 14:04:24

标签: python python-2.7 web-scraping beautifulsoup

我在BeautifulSoup`中找到href值有问题

from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624")
bsObj = BeautifulSoup(html)
for link in bsObj.find("h3", {"class":"r"}).findAll("a"):
  if 'href' in link.attrs:
    print(link.attrs['href'])

我一直有错误:

  

" AttributeError:' NoneType'对象没有属性' findAll'

1 个答案:

答案 0 :(得分:3)

您必须将User-Agent字符串更改为urllib的默认用户代理以外的其他字符串。

from urllib2 import urlopen, Request
from bs4 import BeautifulSoup

url = "https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624"
html = urlopen(Request(url, headers={'User-Agent':'Mozilla/5'})).read()
bsObj = BeautifulSoup(html, 'html.parser')

for link in bsObj.find("h3", {"class":"r"}).findAll("a", href=True):
    print(link['href'])

另请注意,此表达式将仅选择第一个链接。如果要选择页面中的所有链接,请使用以下表达式:

links = bsObj.select("h3.r a[href]")
for link in links:
    print(link['href'])