努力应对刮擦的大学橄榄球成绩的格式

时间:2019-11-17 22:28:36

标签: python pandas beautifulsoup

此处是Python新手。我正在尝试格式化导入的大学橄榄球比分(根据梅西评分),因此可以将其导入Excel。我需要创建一些标题[“ Date”,“ Winner”,“ Score”,“ Loser”,“ Score”],并在各列之间添加一些空间以提高可读性。从我的收集中可以得出Pandas DataFrame。任何帮助将不胜感激。

到目前为止,这是我的代码:

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request

address = 'https://www.masseyratings.com/scores.php?s=308075&sub=11604&dt=20191119'
response = urllib.request.urlopen(address)
html = response.read()

soup = BeautifulSoup(html,"html.parser")


table = soup.find("pre").get_text(strip=True)



print(table)

我得到的输出:

2019-11-16Southern Miss36 @UT San Antonio17           
2019-11-16 @Washington St49Stanford22           
2019-11-16TCU33 @Texas Tech31           
2019-11-16 @Temple29Tulane21           
2019-11-16Troy63 @Texas St27           
2019-11-16 @UAB37UTEP10           
2019-11-16 @Utah49UCLA3           
2019-11-16 @Utah St26Wyoming21           
2019-11-16 @Clemson52Wake Forest3           
2019-11-16 @Florida St49Alabama St12           
2019-11-16Virginia Tech45 @Georgia Tech0           
2019-11-16Ohio St56 @Rutgers21           
2019-11-16 @Iowa St23Texas21           
2019-11-16 @BYU42Idaho St10           
2019-11-19Ohio0 @Bowling Green0 Sch       
2019-11-19E Michigan0 @N Illinois0 Sch       

1 个答案:

答案 0 :(得分:0)

字符串拆分可能是一个好主意,但是您可以在此特定页面上使用正则表达式模式来提取4列

import re, csv, requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.masseyratings.com/scores.php?s=308075&sub=11604&dt=20191119')
soup = bs(r.content, 'lxml')
p = re.compile(r'([^0-9-]+)\s{3,}')
p2 = re.compile(r'\s(\d+)\s')

with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
    w.writerow(['Date','Winner','Score1','Loser','Score2'])

    for line in soup.select_one('pre').text.split('\n')[:-4]:
        matches1 = p.findall(line)
        matches2 = p2.findall(line)
        row = [re.search(r'(\d{4}-\d{2}-\d{2})',line).group(0), matches1[0].strip(), matches2[0], matches1[1].strip(), matches2[1]]
        w.writerow(row)