使用python BS4将抓取的数据写入CSV

时间:2018-10-01 12:36:19

标签: python csv web-scraping beautifulsoup html-parsing

我正在尝试从此Website抓取数据,在为每种类型的数据按列格式化内容时遇到了麻烦。例如,我有夏令时间,它是逐行写的,我希望日光成为列的标题,然后让其成为各个列的值,以此类推。 CSV输出如下:

"Dawn:"
"06:42"
"Sunrise:"
"07:16"
"Moonrise:"
"18:03"
""
"Dusk:"
"20:10"
"Sunset:�"
"19:36"
"Moonset:"
"01:55"
"Daylight:"
"13:28"
"Length:"
"12:20"
"Phase:"
"Waxing Gibbous"
"Temperature and Humidity "
"Temperature"
"7.9��C"
"Dew�Point "
"7.1��C"
"Windchill"
"7.4��C"
"Humidity"
"95%"
"Heat Index"
"7.9��C"
"Apparent Temperature"
"5.8��C"
"Solar Radiation"
"0�W/m�"
"Evapotranspiration Today"
"0.10�mm"
"Rainfall"
"Rainfall�Today"
"0.2�mm"
"Rainfall�Rate"
"0.0�mm/hr"
"Rainfall�This�Month"
"33.4�mm"
"Rainfall�This�Year"
"749.8�mm"
"Rainfall�Last Hour"
"0.2�mm"
"Last rainfall"
"2018-09-20 21:52"
"Wind"
"Wind�Speed�(gust)"
"12.2�kts"
"Wind�Speed�(avg)"
"4.1�kts"
"Wind Bearing"
"329� NNW"
"Beaufort�F2"
"Light breeze"
"Pressure"
"Barometer�"
"1000.14�mb"
"Rising quickly"
"1.28�mb/hr"
":now::gauges::today::yesterday::this�month::this�year::records::monthly�records::trends::forum::webcam:"

我的源代码是:

import bs4
import requests
from bs4 import BeautifulSoup
import uuid
import csv
import re
class corkHrb():
    def __init__(self):
        global homePage
        global downloadDir
        global filname
        downloadDir = "C:\\Users\\user\\PycharmProjects\\digitalOcean\\venv\\testDara\\"
        uFileName = str(uuid.uuid4())
        filname = downloadDir + uFileName + ".csv"
        homePage = requests.get("http://86.43.106.118/weather/cumulus/")


    def pageHtml(self):
        soup = BeautifulSoup(homePage.content, 'html.parser')
        uFileName = str(uuid.uuid4())
        filname = downloadDir + uFileName + ".csv"
        riverEstuaryTable = []
        data = []
        for table in soup.find_all('table'):
            for tableRecords in table.find_all('table'):
                for tableCells in tableRecords.find_all('td'):
                    data.append(tableCells.text.strip())
                print(data)
        for remTable in soup.find_all('table'):
            test = remTable
            secondData = []
        for t in test.find_all('tr'):
            for tCells in t.find_all('td'):
                secondData.append(tCells.text.strip('\t'))
        print(secondData)

        with open(filname, 'w', newline='' ) as f:
            writer = csv.writer(f,quoting=csv.QUOTE_ALL, escapechar=',', lineterminator='\n')
            for r in data:
                writer.writerow([r])
            for tre in secondData:
                writer.writerow([tre])




if __name__ == '__main__':
    objCall = corkHrb()
    objCall.pageHtml()

任何帮助将不胜感激,谢谢

1 个答案:

答案 0 :(得分:1)

您可以遍历td元素:

import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('http://86.43.106.118/weather/cumulus/').text, 'html.parser')
new_data = [[[c.text for c in b.find_all('td')] for b in i.find_all('tr')] for i in d.find_all('table')]
_, *result = new_data
*new_results, footer = [list(filter(None, i)) for b in result for i in b]
grouped = [{c[i]:c[i+1] for i in range(0, len(c), 2)} for c in new_results if len(c) > 1]

输出:

[{'Dawn:': '06:42', 'Sunrise:': '07:16', 'Moonrise:': '18:03'}, {'Dusk:': '20:10', 'Sunset:\xa0': '19:36', 'Moonset:': '01:55'}, {'Daylight:': '13:28', 'Length:': '12:20', 'Phase:': 'Waxing Gibbous'}, {'Temperature': '7.9\xa0°C', 'Dew\xa0Point ': '7.1\xa0°C'}, {'Windchill': '7.4\xa0°C', 'Humidity': '95%'}, {'Heat Index': '7.9\xa0°C', 'Apparent Temperature': '5.8\xa0°C'}, {'Solar Radiation': '0\xa0W/m²', 'Evapotranspiration Today': '0.10\xa0mm'}, {'Rainfall\xa0Today': '0.2\xa0mm', 'Rainfall\xa0Rate': '0.0\xa0mm/hr'}, {'Rainfall\xa0This\xa0Month': '33.4\xa0mm', 'Rainfall\xa0This\xa0Year': '749.8\xa0mm'}, {'Rainfall\xa0Last Hour': '0.2\xa0mm', 'Last rainfall': '2018-09-20 21:52'}, {'Wind\xa0Speed\xa0(gust)': '12.2\xa0kts', 'Wind\xa0Speed\xa0(avg)': '4.1\xa0kts'}, {'Wind Bearing': '329° NNW', 'Beaufort\xa0F2': 'Light breeze'}, {'Barometer\xa0': '1000.14\xa0mb', 'Rising quickly': '1.28\xa0mb/hr'}]

然后,写到csv

import csv
headers = set([i for b in grouped for i in b])
with open('cork_weather.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([list(headers), *[[c.get(i, '') for i in headers] for c in grouped]])

输出:

Rainfall Today,Dusk:,Sunrise:,Dawn:,Humidity,Last rainfall,Rainfall Rate,Sunset: ,Heat Index,Phase:,Wind Speed (avg),Rising quickly,Temperature,Windchill,Rainfall Last Hour,Barometer ,Dew Point ,Rainfall This Year,Apparent Temperature,Daylight:,Beaufort F2,Moonrise:,Rainfall This Month,Length:,Evapotranspiration Today,Solar Radiation,Wind Speed (gust),Moonset:,Wind Bearing
,,07:16,06:42,,,,,,,,,,,,,,,,,,18:03,,,,,,,
,20:10,,,,,,19:36,,,,,,,,,,,,,,,,,,,,01:55,
,,,,,,,,,Waxing Gibbous,,,,,,,,,,13:28,,,,12:20,,,,,
,,,,,,,,,,,,7.9 °C,,,,7.1 °C,,,,,,,,,,,,
,,,,95%,,,,,,,,,7.4 °C,,,,,,,,,,,,,,,
,,,,,,,,7.9 °C,,,,,,,,,,5.8 °C,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,0.10 mm,0 W/m²,,,
0.2 mm,,,,,,0.0 mm/hr,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,749.8 mm,,,,,33.4 mm,,,,,,
,,,,,2018-09-20 21:52,,,,,,,,,0.2 mm,,,,,,,,,,,,,,
,,,,,,,,,,4.1 kts,,,,,,,,,,,,,,,,12.2 kts,,
,,,,,,,,,,,,,,,,,,,,Light breeze,,,,,,,,329° NNW
,,,,,,,,,,,1.28 mb/hr,,,,1000.14 mb,,,,,,,,,,,,,