从WebScraping结果创建Pandas Dataframe

时间:2017-10-05 20:20:01

标签: python pandas dataframe web-scraping beautifulsoup

我正在尝试从espn中抓取一个表并将数据发送到pandas数据帧,以便将其导出到excel。我已经完成了大部分的刮痧工作,但我仍然不知道如何发送每个scra'标记到for循环中的唯一数据框单元格。 (代码如下)有什么想法?谢谢!

import requests
import urllib.request
from bs4 import BeautifulSoup
import re
import os
import csv
import pandas as pd

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

soup = make_soup("http://www.espn.com/nba/statistics/player/_/stat/scoring-
per-game/sort/avgPoints/qualified/false")

regex = re.compile("^[e-o]")

for record in soup.findAll('tr', {"class":regex}):
    for data in record.findAll('td'):
        print(data)

1 个答案:

答案 0 :(得分:0)

我实际上最近正在为一个班级抓住每日幻想体育算法的体育网站。这是我写的剧本。也许这种方法对您有用。建一本字典。将其转换为数据帧。

    url = http://www.footballdb.com/stats/stats.html?lg=NFL&yr={0}&type=reg&mode={1}&limit=all

    result = requests.get(url)
    c = result.content

    # Set as Beautiful Soup Object
    soup = BeautifulSoup(c)

    # Go to the section of interest
    tables = soup.find("table",{'class':'statistics'})

    data = {}
    headers = {}
    for i, header in enumerate(tables.findAll('th')):
        data[i] = {}
        headers[i] = str(header.get_text())

    table = tables.find('tbody')
    for r, row in enumerate(table.select('tr')):
        for i, cell in enumerate(row.select('td')):
            try:
                data[i][r] = str(cell.get_text())
            except:
                stat = strip_non_ascii(cell.get_text())
                data[i][r] = stat

    for i, name in enumerate(tables.select('tbody .left .hidden-xs a')):
        data[0][i] = str(name.get_text())

    df = pd.DataFrame(data=data)