使用BeautifulSoup提取数据并输出到CSV

时间:2016-07-26 07:09:56

标签: python csv

正如前面提到的问题所提到的,我正在使用带有python的Beautiful soup来从网站上检索天气数据。

以下是网站的外观:

<channel>
<title>2 Hour Forecast</title>
<source>Meteorological Services Singapore</source>
<description>2 Hour Forecast</description>
<item>
<title>Nowcast Table</title>
<category>Singapore Weather Conditions</category>
<forecastIssue date="18-07-2016" time="03:30 PM"/>
<validTime>3.30 pm to 5.30 pm</validTime>
<weatherForecast>
<area forecast="TL" lat="1.37500000" lon="103.83900000" name="Ang Mo Kio"/>
<area forecast="SH" lat="1.32100000" lon="103.92400000" name="Bedok"/>
<area forecast="TL" lat="1.35077200" lon="103.83900000" name="Bishan"/>
<area forecast="CL" lat="1.30400000" lon="103.70100000" name="Boon Lay"/>
<area forecast="CL" lat="1.35300000" lon="103.75400000" name="Bukit Batok"/>
<area forecast="CL" lat="1.27700000" lon="103.81900000" name="Bukit Merah"/>` 
<channel>

我设法使用这些代码检索我需要的信息:

import requests
from bs4 import BeautifulSoup
import urllib3

#getting the ValidTime

r = requests.get('http://www.nea.gov.sg/api/WebAPI/?   
dataset=2hr_nowcast&keyref=781CF461BB6606AD907750DFD1D07667C6E7C5141804F45D')
soup = BeautifulSoup(r.content, "xml")
time = soup.find('validTime').string
print "validTime: " + time

#getting the date

for currentdate in soup.find_all('item'):
    element = currentdate.find('forecastIssue')
    print "date: " + element['date']

#getting the time

for currentdate in soup.find_all('item'):
    element = currentdate.find('forecastIssue')
    print "time: " + element['time'] 

for area in soup.find('weatherForecast').find_all('area'):
    area_attrs_li = [area.attrs for area in soup.find('weatherForecast').find_all('area')]
    print area_attrs_li

以下是我的结果:

{'lat': u'1.34039000', 'lon': u'103.70500000', 'name': u'Jurong West',   
'forecast': u'LR'}, {'lat': u'1.31200000', 'lon': u'103.86200000', 'name':  
 u'Kallang', 'forecast': u'LR'},
  1. 如何删除你&#39;从结果?我尝试使用谷歌搜索时找到的方法,但它似乎无法正常工作
  2. 我在Python方面并不强大,并且已经坚持了很长一段时间。

    编辑:我试过这样做:

    f = open("C:\\scripts\\nea.csv" , 'wt')
    
    try:
     for area in area_attrs_li:
     writer = csv.writer(f)
     writer.writerow( (time, element['date'], element['time'], area_attrs_li))
    
    finally:
      f.close()
    
    print open("C:/scripts/nea.csv", 'rt').read()   
    

    但是,我希望将该区域分开,因为CSV中的记录是重复的:

    records in the CSV

    谢谢。

1 个答案:

答案 0 :(得分:1)

编辑1 -Topic:

您错过了转义字符:

C:\scripts>python neaweather.py
File "neaweather.py", line 30
writer.writerow( ('time', 'element['date']', 'element['time']', 'area_attrs_li') )

writer.writerow( ('time', 'element[\'date\']', 'element[\'time\']', 'area_attrs_li') 
                                   ^

SyntaxError:语法无效

编辑2:

如果要插入值:

writer.writerow( (time, element['date'], element['time'], area_attrs_li) )

编辑3:

将结果拆分为不同的行:

for area in area_attrs_li:
    writer.writerow( (time, element['date'], element['time'], area)

编辑4: 拆分根本不正确,但它应该更好地理解如何解析和拆分数据以根据您的需要进行更改。 enter image description here 要在图像中显示时再次分割区域元素,可以解析它

for area in area_attrs_li:
    # cut off the characters you don't need
    area = area.replace('[','')
    area = area.replace(']','')
    area = area.replace('{','')
    area = area.replace('}','')

    # remove other characters
    area = area.replace("u'","\"").replace("'","\"")

    # split the string into a list
    areaList = area.split(",")

    # create your own csv-seperator
    ownRowElement = ';'.join(areaList)

    writer.writerow( (time, element['date'], element['time'], ownRowElement)

Offtopic: 这对我有用:

import csv
import json

x="""[ 
    {'lat': u'1.34039000', 'lon': u'103.70500000', 'name': u'Jurong West','forecast': u'LR'}
]"""

jsontxt = json.loads(x.replace("u'","\"").replace("'","\""))

f = csv.writer(open("test.csv", "w+"))

# Write CSV Header, If you dont need that, remove this line
f.writerow(['lat', 'lon', 'name', 'forecast'])

for jsontext in jsontxt:
    f.writerow([jsontext["lat"], 
                jsontext["lon"], 
                jsontext["name"], 
                jsontext["forecast"],
                ])