在Selenium中抓取多个URL并写入JSON

时间:2018-10-21 15:12:01

标签: python python-3.x selenium selenium-webdriver

我正在使用Selenium开发刮板。

我已经编写了脚本,并且脚本正在正确抓取,但是我尝试抓取多个URL,然后将结果写入JSON。

脚本抓取并成功打印,但是我在JSON中仅得到一个结果-第二个URL的详细信息(打印时得到了两个结果)。

如何获得两个URL的结果?

我认为我需要为JSON数据添加另一个FOR LOOP,但无法弄清楚如何添加它!

这是我正在使用的代码:

# -*- coding: UTF-8 -*-
from selenium import webdriver
import time
import json

def writeToJSONFile(path, fileName, data):
    filePathNameWExt = './' + path + '/' + fileName + '.json'
    with open(filePathNameWExt, 'a') as fp:
        json.dump(data, fp, ensure_ascii=False)

browser = webdriver.Firefox(executable_path="/Users/path/geckodriver")

urls = ['https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d8122594-Reviews-Humble_Grape_Battersea-London_England.html','https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d5561842-Reviews-Gastronhome-London_England.html']

data = {}
for url in urls:

    browser.get(url)
    page = browser.find_element_by_class_name('non_hotels_like')
    title = page.find_element_by_class_name('heading_title').text
    street_address = page.find_element_by_class_name('street-address').text

    print(title)
    print(street_address)


data = {}
data['title'] = title
data['street_address'] = street_address

filename = 'properties'
writeToJSONFile('./', filename, data)

browser.quit()

2 个答案:

答案 0 :(得分:3)

您正在尝试将具有相同键的值添加到字典,而Python字典只能包含唯一键!因此,您不必重写第二个title而是覆盖它。与street_address

相同

您可以尝试将数据另存为词典列表:

data = []

for url in urls:
    browser.get(url)
    page = browser.find_element_by_class_name('non_hotels_like')
    title = page.find_element_by_class_name('heading_title').text
    street_address = page.find_element_by_class_name('street-address').text

    print(title)
    print(street_address)

    data.append({'title': title, 'street_address': street_address})

答案 1 :(得分:1)

您正在循环后休息数据变量...

所以...我所做的是使用枚举添加迭代索引并将其格式化为键...

尝试一下应该可以:

from selenium import webdriver
import time
import json

def writeToJSONFile(path, fileName, data):
    filePathNameWExt = './' + path + '/' + fileName + '.json'
    with open(filePathNameWExt, 'a') as fp:
        json.dump(data, fp, ensure_ascii=False)

browser = webdriver.Firefox(executable_path="/Users/path/geckodriver")

urls = ['https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d8122594-Reviews-Humble_Grape_Battersea-London_England.html','https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d5561842-Reviews-Gastronhome-London_England.html']

data = {}
for i, url in enumerate(urls):

    browser.get(url)
    page = browser.find_element_by_class_name('non_hotels_like')
    title = page.find_element_by_class_name('heading_title').text
    street_address = page.find_element_by_class_name('street-address').text
    # this 'f' string formating is suported from Python 3.6+ you can use other format... (for a cleaner job use list see the excpted answer...) 
    data[f'{i}title'] = title
    data[f'{i}street_address'] = street_address

    print(title)
    print(street_address)


filename = 'properties'
writeToJSONFile('./', filename, data)

browser.quit()

希望这对您有所帮助!