从Understat.com抓取特定元素

时间:2019-02-14 21:33:33

标签: python web-scraping

我想从此站点上的多个匹配项中检索特定统计信息(PPDA):

https // understat.com / match / xxxx

我创建了以下代码来解析HTML并使用Python遍历每个匹配项,但是我在努力提取特定的统计信息并将其加载到csv和图形中。我是初学者,任何帮助将不胜感激!

代码:

import pandas as pd
import re
import random
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
import csv

for i in range(9577,9807):
    ppda_url = 'https://understat.com/match/' + str(i)
ppda_data = requests.get(ppda_url)
ppda_html = ppda_data.content
xml
soup = BeautifulSoup(ppda_html, 'lxml')
options=webdriver.ChromeOptions()
driver = webdriver.Chrome(chrome_options=options)
driver.get(ppda_url)
soup = BeautifulSoup(driver.page_source, 'lxml')

1 个答案:

答案 0 :(得分:0)

要使用BeautifulSoup提取数据并将其写入CSV文件,请首先找到带有PPDA文本的div元素。然后找到具有进度值类的下一个div元素,然后具有进度值类的下一个div元素,并从最后两个div中获取数据。像这样将其写入csv文件。

import requests
from bs4 import BeautifulSoup
import csv

with open('ppda.csv', 'w', newline='') as csvfile:
    for i in range(9577,9807):
        ppda_url = 'https://understat.com/match/' + str(i)
        ppda_data = requests.get(ppda_url)
        ppda_html = ppda_data.content
        soup = BeautifulSoup(ppda_html, 'lxml')
        ppda = soup.find("div", string='PPDA')
        home = ppda.findNext('div', {'class':"progress-value"})
        print (home.text, home.findNext('div', {'class':"progress-value"}).text)
        writer = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        writer.writerow([home.text, home.findNext('div', {'class':"progress-value"}).text])

要绘制图表,请先从matplotlib开始。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(columns=['HOME', 'AWAY'])
for i in range(9577,9807):
    ppda_url = 'https://understat.com/match/' + str(i)
    ppda_data = requests.get(ppda_url)
    ppda_html = ppda_data.content
    soup = BeautifulSoup(ppda_html, 'lxml')
    ppda = soup.find("div", string='PPDA')
    home = ppda.findNext('div', {'class':"progress-value"})
    print (home.text, home.findNext('div', {'class':"progress-value"}).text)
    df = df.append({'HOME': float(home.text), 'AWAY' : float(home.findNext('div', {'class':"progress-value"}).text)}, ignore_index=True)
#print (df)
df.to_csv("ppda2.csv", encoding='utf-8', index=False)
df.plot.bar()
plt.show()

输出:CSV文件和图形

Graph