通过循环抓取不同的日期

时间:2019-05-05 07:00:30

标签: python selenium web-scraping beautifulsoup

我有一个python代码,可从足球结果和赔率网站上抓取一页

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs
import pandas as pd
import copy
import numpy as np

results = []

d = webdriver.Chrome(executable_path = r'C:\chromedriver_win32\chromedriver.exe')

u = 'https://1x2.lucksport.com/result_en.shtml?dt=' + '2019-05-02' + '&cid=156'

d.get(u)
WebDriverWait(d, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#odds_tb tr[class]")))
soup = bs(d.page_source, 'lxml')
rows = soup.select('#odds_tb tr[class]')

headers = ['Comp', 'Time', 'Match' ,'Odds', 'H', 'A', 'Res']
i = 1
for row in rows[1:]:    
    cols = [td.text for td in row.select('td')]

    if (i % 2 == 1):
        record = {'Comp' : cols[0],
                  'Time' : cols[1],
                  'Match' : ' v '.join([cols[2], cols[6]]),
                  'Odds' : 'op',
                  'H' : cols[3], 
                  'A' : cols[5],
                  'Res' : cols[7]}
    else:
        record['Odds'] = 'cl'
        record['H'] = cols[0] 
        record['A'] = cols[2]
    results.append(copy.deepcopy(record))
    i+=1

df = pd.DataFrame(results, columns = headers)
d.quit()

我想创建一个循环并刮取所有以前的日期(对于特定的日期范围(例如上个月)),因此我创建了一个日期列表以在循环中使用它:

D =  datetime.datetime.now().date()
date_list = [D - datetime.timedelta(days=x) for x in range(0, 30)]
dates = []
for i in date_list:
    date = str(i)
    dates.append(date)

然后我尝试创建一个循环,希望该循环返回所有先前日期数据的数据框

results = []

for date in dates:
    d = webdriver.Chrome(executable_path = r'C:\chromedriver_win32\chromedriver.exe')

    u = 'https://1x2.lucksport.com/result_en.shtml?dt=' + date + '&cid=156'
    i = 1
    d.get(u)
    WebDriverWait(d, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#odds_tb tr[class]")))
    soup = bs(d.page_source, 'lxml')
    rows = soup.select('#odds_tb tr[class]')

    headers = ['Comp', 'Time', 'Match' ,'Odds', 'H', 'A', 'Res', 'Date']

    for row in rows[1:]:    
        cols = [td.text for td in row.select('td')]

        if (i % 2 == 1):
            record = {'Comp' : cols[0],
                      'Time' : cols[1],
                      'Match' : ' v '.join([cols[2], cols[6]]),
                      'Odds' : 'op',
                      'H' : cols[3], 
                      'A' : cols[5],
                      'Res' : cols[7],
                     'Date' : date}
        else:
            record['Odds'] = 'cl'
            record['H'] = cols[0] 
            record['A'] = cols[2]
        results.append(copy.deepcopy(record))
        i+=1
        d.quit()

df = pd.DataFrame(results, columns = headers)

但它返回错误

TypeError                                 Traceback (most recent call last)
<ipython-input-6-0668d7389fc6> in <module>
     33         cols = [td.text for td in row.select('td')]
     34 
---> 35         if (i % 2 == 1):
     36             record = {'Comp' : cols[0],
     37                       'Time' : cols[1],

TypeError: unsupported operand type(s) for %: 'datetime.date' and 'int'

1 个答案:

答案 0 :(得分:2)

i的类型为datetime.date,原因是

D =  datetime.datetime.now().date()
date_list = [D - datetime.timedelta(days=x) for x in range(0, 30)]
dates = []
for i in date_list:

date_listdatetime.date的列表,因此i将是此列表中该类型的元素。

您稍后尝试将其视为int;因此就是你的错误。

if (i % 2 == 1):

例如,在迭代i时使用另一个循环计数器变量或更改date_list

import datetime
D =  datetime.datetime.now().date()
date_list = [D - datetime.timedelta(days=x) for x in range(0, 30)]
dates = []
i = 1
for iDate in date_list:
    if (i % 2 == 1):
        print(i)
    i+=1

旁注:

您的d.quit()在For循环内并且可以在它之后,而d = webdriver.Chrome(executable_path = r'C:\chromedriver_win32\chromedriver.exe')可以在循环之前。然后您从头到尾只使用一个实例。