将html刮入csv文件

时间:2016-02-13 20:13:15

标签: python html csv web-scraping beautifulsoup

下面的代码从以下页面中删除数据: “http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005

它会擦除所有相关字段并将其打印到屏幕上。但是,我想尝试以表格形式将数据打印到csv文件中,以导出到电子表格或数据库中。

在网站源HTML中,曲目,日期,日期时间(比赛时间)等级,距离和奖品来自div类“resultsBlockheader”,并且在网页上形成了竞赛卡的顶部区域。

源HTML中的种族主体来自div类“resultsBlock”,这包括完成位置(Fin)Greyhound,Trap,SP,Time / Sec和Time distance。

最终看起来像这样

track,date,datetime,grade,distance,prize,fin,greyhound,trap,SP,timeSec,time distance

这是否可行,或者我必须以表格形式将其打印到屏幕上才能将其导出到csv。

 from urllib import urlopen
 from bs4 import BeautifulSoup

 html = urlopen("http://www.gbgb.org.uk/resultsMeeting.aspx?id=136005")
 bsObj = BeautifulSoup(html, 'lxml')

 nameList = bsObj. findAll("div", {"class": "track"})
 for name in nameList:
 List = bsObj. findAll("div", {"class": "distance"})
 for name in nameList:
     print(name. get_text())
 nameList = bsObj. findAll("div", {"class": "prizes"})
 for name in nameList:
     print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "first essential fin"})
 for name in nameList:
     print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "essential greyhound"})
 for name in nameList:
     print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "trap"})
 for name in nameList:
     print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "sp"})
 for name in nameList:
     print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "timeSec"})
 for name in nameList:
     print(name. get_text())
 nameList = bsObj. findAll("li", {"class": "timeDistance"})
 for name in nameList:
     print(name. get_text())

 nameList = bsObj. findAll("li", {"class": "essential trainer"})
 for name in nameList:
     print(name. get_text())

 nameList = bsObj. findAll("li", {"class": "first essential comment"})
 for name in nameList:
     print(name. get_text())

 nameList = bsObj. findAll("div", {"class": "resultsBlockFooter"})
 for name in nameList:
     print(name. get_text())

 nameList = bsObj. findAll("li", {"class": "first essential"})
 for name in nameList:
     print(name. get_text())

1 个答案:

答案 0 :(得分:1)

不确定为什么你没有按照this answer中建议的代码来解决上一个问题 - 它实际上解决了分组字段问题。

以下是将trackdategreyhound转储到csv的跟进代码:

import csv

from bs4 import BeautifulSoup
import requests


html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=135754").text
soup = BeautifulSoup(html, 'lxml')

rows = []
for header in soup.find_all("div", class_="resultsBlockHeader"):
    track = header.find("div", class_="track").get_text(strip=True).encode('ascii', 'ignore').strip("|")
    date = header.find("div", class_="date").get_text(strip=True).encode('ascii', 'ignore').strip("|")

    results = header.find_next_sibling("div", class_="resultsBlock").find_all("ul", class_="line1")
    for result in results:
        greyhound = result.find("li", class_="greyhound").get_text(strip=True)

        rows.append({
            "track": track,
            "date": date,
            "greyhound": greyhound
        })


with open("results.csv", "w") as f:
    writer = csv.DictWriter(f, ["track", "date", "greyhound"])

    for row in rows:
        writer.writerow(row)

运行代码后results.csv的内容:

Sheffield,02/02/16,Miss Eastwood
Sheffield,02/02/16,Sapphire Man
Sheffield,02/02/16,Swift Millican
...
Sheffield,02/02/16,Geelo Storm
Sheffield,02/02/16,Reflected Light
Sheffield,02/02/16,Boozed Flame

请注意,我在此处使用requests,但如果您愿意,可以使用urllib2