我的工作只刮擦了最后一页,而不是全部

时间:2017-09-04 08:15:41

标签: python python-3.x csv selenium selenium-webdriver

我的抓取工作似乎只是在网页的最后一页写入CSV。我假设这是因为它循环遍历所有页面然后写入csv。它会刮掉元素并在控制台中打印它们。您是否必须立即循环并写入每个页面的csv,因为它无法存储数据?我已经尝试调整我的代码以适应这一点,但我似乎无法让它工作。

提前致谢。

我也试过了一个不同的方法,但同样的事情似乎发生在https://www.pastebin.ca/3863340

from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    import csv
    import requests
    import time
    from selenium import webdriver
    from random import shuffle
    import csv

driver = webdriver.Chrome()
driver.set_window_size(1024, 600)
driver.maximize_window()

driver.get('https://www.bookmaker.com.au/sports/soccer/')

SCROLL_PAUSE_TIME = 0.5


last_height = driver.execute_script("return document.body.scrollHeight")

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")


    time.sleep(SCROLL_PAUSE_TIME)


    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

time.sleep(1)

elements = driver.find_elements_by_css_selector(".market-match:nth-child(2) .market-group a , .market-match:nth-child(1) .market-group a")
elem_href1 = [element.get_attribute("href") for element in elements]
print(elem_href1)
print (len(elem_href1))
shuffle(elem_href1)
for link in elem_href1:
    driver.get(link)
    ...
    time.sleep(2)

    # link
    elems = driver.find_elements_by_css_selector("h3 a[Href*='/sports/soccer']")
    elem_href = []
    for elem in elems:
     print(elem.get_attribute("href"))
     elem_href.append(elem.get_attribute("href"))

    # TEAM
    langs = driver.find_elements_by_css_selector(".row:nth-child(1) td:nth-child(1)")
    langs_text = []

    for lang in langs:
        print(lang.text)
        langs_text.append(lang.text)

    time.sleep(0)

    # odds
    langs1 = driver.find_elements_by_css_selector("a.odds.quickbet")
    langs1_text = []

    for lang in langs1:
        print(lang.text)
        langs1_text.append(lang.text)

    time.sleep(0)

    with open('vtg12.csv', 'a', newline='') as outfile:
        writer = csv.writer(outfile)
        for row in zip(langs1_text, langs_text, elem_href):
            writer.writerow(row)

2 个答案:

答案 0 :(得分:2)

问题在于您每次迭代都会覆盖CSV,因此只有在脚本结束时才会保留最后一条记录。

更改

with open('vtg12.csv', 'a', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in zip(langs1_text, langs_text, elem_href):
        writer.writerow(row)

with open('vtg12.csv', 'a+', newline='') as outfile:
    writer = csv.writer(outfile)
    for row in zip(langs1_text, langs_text, elem_href):
        writer.writerow(row)

a+将以附加模式打开文件

答案 1 :(得分:2)

在最顶端:

def append_to_csv(csv_list, output_filename):
    with open(output_filename, 'a', newline='') as fp:
        a = csv.writer(fp)
        data = [csv_list]
        a.writerows(data)

然后替换

with open('vtg12.csv', 'a', newline='') as outfile:
        writer = csv.writer(outfile)
        for row in zip(langs1_text, langs_text, elem_href):
            writer.writerow(row)

使用:

for row in zip(langs_text, langs2_text, langs_text, elem_href):

    append_to_csv(row, 'vtg12.csv')