我正在尝试使用pandas从URL清理文本文件,我的想法是将其拆分为单独的列,再添加3列并导出到csv。
我已经尝试清理文件(我相信它被“”分隔),到目前为止没有任何效果。
# script to check and clean text file for 'aberporth' station
import pandas as pd
import requests
# api-endpoint for current weather
URLH = "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt"
with requests.session() as s:
# sending get for histroy text file
r = s.get(URLH)
df1 = pd.read_csv(io.StringIO(r.text), sep=" ", skiprows=5, error_bad_lines=False)
df2 = pd.read_csv(io.StringIO(r.text), nrows=1)
# df1['location'] = df2.columns.values[0]
# _, lat, _, lon = df2.index[0][1].split()
# df1['lat'], df1['lon'] = lat, lon
df1.dropna(how='all')
df1.to_csv('Aberporth.txt', sep='|', index=True)
更糟糕的是,文件本身具有不均匀的列,并且在944行下的某个位置,它又增加了一个列,我可以跳过该行以减少错误行上的错误。在这一点上,我对应该如何进行以及是否应该看熊猫以外的其他事物感到迷茫。
答案 0 :(得分:1)
您实际上不需要熊猫。内置的csv
模块可以正常工作。
数据采用固定宽度格式(与“定界格式”不同):
Aberporth Location: 224100E 252100N, Lat 52.139 Lon -4.570, 133 metres amsl Estimated data is marked with a * after the value. Missing data (more than 2 days missing in month) is marked by ---. Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder. yyyy mm tmax tmin af rain sun degC degC days mm hours 1942 2 4.2 -0.6 --- 13.8 80.3 1942 3 9.7 3.7 --- 58.0 117.9 ...
因此,我们可以将其拆分为预定义的索引(我们必须对其进行计数和硬编码,并且可能会更改),或者可以使用正则表达式在“多个空格”上拆分,在这种情况下,它确实不管确切的列位置在哪里:
import requests
import re
import csv
def get_values(url):
resp = requests.get(url)
for line in resp.text.splitlines():
values = re.split("\s+", line.strip())
# skip all lines that do not have a year as first item
if not re.match("^\d{4}$", values[0]):
continue
# replace all '---' by None
values = [None if v == '---' else v for v in values]
yield values
url = "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/aberporthdata.txt"
with open('out.csv', 'w', encoding='utf8', newline='') as f:
writer = csv.writer(f)
writer.writerows(get_values(url))
如果需要,可以执行writer.writerow(['yyyy','mm','tmax','tmin','af','rain','sun'])
获取标题行。