尝试网页抓取时出错

时间:2018-09-26 06:21:24

标签: python python-3.x web-scraping jupyter-notebook export-to-csv

我一直在尝试获取数据并导出到CSV文件。我使用2个url通过设置导入了以下内容的主URL页面和第二个URL主页来从多个页面获取数据:

from bs4 import BeautifulSoup
import urllib.request
from urllib.parse import urlparse, parse_qs
import csv

def get_page(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    mainpage = response.read().decode('utf-8')
    return mainpage
mainpage = get_page(www.mainwebsite.com)
mainpage_parser = BeautifulSoup(mainpage,'html.parser')
secondpage = get_page('www.secondmainwebsite.com')
secondpage_parser = BeautifulSoup(secondpage,'html.parser')

数据的格式相同,例如标题,地址;因此,我在每个类中使用的代码是“ find”或“ find_all”;例如,

try:
    name = page_parser.find("h1",{"class":"xxx"}).find("a").get_text()
print(name)
except:
print(name)

它起作用了。但是,我无法从此html类的url链接中获取“ lat”和“ lon”:

<img class="aaa" alt="map" data-track-id="static-map" width="97" height="142" src="https://www.website.com/aaaaaaa&map=StreetMapHD&width=194&height=284&lat=18.832687&lon=98.998473&level=15& returnImage=true">

我要获取经度和纬度的代码是:

   for gps in secondpage_parser.find_all('img',{"class":"aaa"}, src=True):
      parsed_url = urlparse(gps['src'])
      mykeys = ['lat', 'lon']
      gpslocation = [parse_qs(parsed_url.query)[k][0] for k in mykeys]
   print(gps['src'], parse_qs(parsed_url.params))

但是在“ print(gps ['src'],parse_qs(parsed_url.params))”行上显示名称错误,它指示“ NameError:名称'gps'未定义”

我想知道我在哪部分出错或应该如何解决。请帮忙。

0 个答案:

没有答案