python中的网络抓取表的动态numpy重塑

时间:2020-05-05 19:35:24

标签: python pandas numpy

我在抓取网站时无法获得可用的数据框。我知道我需要将列表变成列表列表,而使用静态数据框则很容易。但是,麻烦之处在于:我每天抓取的数据都会更改,我想自动创建数据框。首先,我抓取数据:

### Libraries/packages
import pandas as pd
import numpy as np
import re
import requests
import datetime
from datetime import datetime
import urllib
from selenium import webdriver
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup


### Function 1
def strava_page():

    urllist = ['https://www.strava.com/login',
               'https://www.strava.com/clubs/roosevelt-island-dc-parkrun']

    return urllist

### Function 2
def strava_login(urllist):

    # navigate to page
    driver = webdriver.Chrome(executable_path = r"/Users/user/Documents/chromedriver")
    driver.get(urllist[1])

    # last week's leaderboard
    last_week = driver.find_element_by_css_selector('body > div.view > div.page.container > div:nth-child(4) > div.spans11 > div > div:nth-child(2) > ul > li:nth-child(1) > span')
    last_week.click()

    # getting rows from leaderboard
    table_rows = []
    myrow = []
    totalrows = len(driver.find_elements_by_xpath("//div[@class='leaderboard']/table/tbody//tr"))
    print("[Number of Rows in Leaderboard]:", totalrows)

    # gets individual rows, and puts each one into its own list
    for i in range(totalrows):
        myrow.clear()
        for items in driver.find_elements_by_xpath("//div[@class='leaderboard']/table/tbody//tr["+str(i+1)+"]/td"):
            myrow.append(items.text)
        table_rows.append(myrow)
        print(myrow)

    driver.close()

    # myrow variable is a list
    print(type(myrow))

    # column names
    my_columns = ['Rank', 'Athlete', 'Distance', 'Runs', 'Longest', 'Avg. Pace', 'Elev. Gain']


    # PROBLEM AREA *************
    new_table = pd.DataFrame(np.array(myrow).reshape(1, 7), columns = my_columns)

    return new_table

### Calling functions
one = strava_page()
two = strava_login(one)
two

我一直遇到cannot reshape data size错误。我知道numpy重塑是正确的方法。但是我无法将myrow的输出放入完整的帧中-即,它仅返回该帧的最后一行:

enter image description here

当我想要Strava网页中表格中的每一行时。如何动态地将每一行放入表中(行数每天都有变化),而不必每次运行脚本时都手动设置.reshape()

作为参考,这是表格的屏幕截图。一共有7列,行数应该反映表中的行数,即使行数每天都在变化:

enter image description here

1 个答案:

答案 0 :(得分:0)

相对简单的修复程序,所要做的就是让我稍微忽略一下工作,然后在循环外玩numpy

new_table = np.array(myrow).reshape(-1, 7)
previous_week = pd.DataFrame(new_table, columns = my_columns)

我摆脱了myrow.clear(),返回了previous_week。我在-1中发现了np.reshape()方法之后,像魅力一样工作。