文件的典型部分如下:
#ICXUAE05424 1909 04 22 99 0935 6 erac-hud 657000 -180000
30 -9999 -9999 100 -9999 -9999 -9999 120 49
30 -9999 -9999 350 -9999 -9999 -9999 119 110
30 -9999 -9999 750-9999 -9999 -9999 149 97
30 -9999 -9999 1250-9999 -9999 -9999 136 123
30 -9999 -9999 1750-9999 -9999 -9999 104 121
30 -9999 -9999 2250-9999 -9999 -9999 117 171
#ICXUAE05424 1909 04 22 99 1820 3 erac-hud 657000 -180000
30 -9999 -9999 100 -9999 -9999 -9999 120 53
30 -9999 -9999 350A -9999 -9999 -9999 111 69
30 -9999 -9999 750B-9999 -9999 -9999 102 55
#ICXUAE05424 1909 04 23 99 0845 5 erac-hud 657000 -180000
30 -9999 -9999 100 -9999 -9999 -9999 31 9
30 -9999 -9999 350 -9999 -9999 -9999 102 62
30 -9999 -9999 750 -9999 -9999 -9999 103 132
30 -9999 -9999 1250 -9999 -9999 -9999 98 120
30 -9999 -9999 1750 -9999 -9999 -9999 101 100
我需要通过将一些(或全部)标头属性附加到其数据行来预处理数据,然后将其转换为csv文件。 如何使用linux bash中的 sed 来实现这一点,如果不是Python Pandas 输出csv文件应该看起来像这样:
lvl12,etime,press,gph,temp,rh,dpdp,wdir,wspd,hour,lattitude,longitude
21,-9999,96900A,234,270A,742,-9999,-9999,-9999,12,316333,748667
20,-9999,95000,-9999,290A,484,-9999,-9999,-9999,12,316333,748667
20,-9999,88700,-9999,290A,454,-9999,-9999,-9999,12,316333,748667
10,-9999,85000,1384A,260A,446,-9999,-9999,-9999,12,316333,748667
10,-9999,70000,3055A,130A,506,-9999,-9999,-9999,12,316333,748667
20,-9999,58400,-9999,0A,690,-9999,-9999,-9999,12,316333,748667
20,-9999,55900,-9999,0A,312,-9999,-9999,-9999,12,316333,748667
10,-9999,50000,5772A,-65A-9999,-9999,320,850,,12,316333,748667
其他数据集信息:
#
前缀行是标题,后面的行是数据。
标头属性为:
station code, year, month, day, etc
分隔的第七个属性空间为npv
,表示后面的数据行数。
数据列是:
lvl12, etime, press, gph, temp, rh, dpdp, wdir, wspd
答案 0 :(得分:1)
您需要逐行手动解析文件,注意标题的位置。
我假设750-9999
等数据实际上有空格750 -9999
?如果不是这种情况,则需要采用固定宽度方法:
这可以通过Python的CSV库完成,如下所示:
import csv
header = ["lvl12", "etime", "press", "gph", "temp", "rh", "dpdp", "wdir", "wspd", "hour", "lattitude", "longitude"]
data = []
with open('weather.txt', newline='') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
csv_output = csv.writer(f_output)
csv_output.writerow(header)
for row in csv_input:
if row[0].startswith('#'):
header = row
else:
csv_output.writerow(row + [header[5]] + header[-2:])
或者如果你想也使用熊猫:
import pandas as pd
import csv
data = []
with open('weather.txt', newline='') as f_input:
csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
for row in csv_input:
if row[0].startswith('#'):
header = row
else:
data.append(row + [header[5]] + header[-2:])
columns = ["lvl12", "etime", "press", "gph", "temp", "rh", "dpdp", "wdir", "wspd", "hour", "lattitude", "longitude"]
df = pd.DataFrame(data, columns=columns)
print(df)
给你:
lvl12 etime press gph temp rh dpdp wdir wspd hour lattitude longitude
0 30 -9999 -9999 100 -9999 -9999 -9999 120 49 0935 657000 -180000
1 30 -9999 -9999 350 -9999 -9999 -9999 119 110 0935 657000 -180000
2 30 -9999 -9999 750 -9999 -9999 -9999 149 97 0935 657000 -180000
3 30 -9999 -9999 1250 -9999 -9999 -9999 136 123 0935 657000 -180000
4 30 -9999 -9999 1750 -9999 -9999 -9999 104 121 0935 657000 -180000
.. ... ... ... ... ... ... ... ... ... ... ... ...
9 30 -9999 -9999 100 -9999 -9999 -9999 31 9 0845 657000 -180000
10 30 -9999 -9999 350 -9999 -9999 -9999 102 62 0845 657000 -180000
11 30 -9999 -9999 750 -9999 -9999 -9999 103 132 0845 657000 -180000
12 30 -9999 -9999 1250 -9999 -9999 -9999 98 120 0845 657000 -180000
13 30 -9999 -9999 1750 -9999 -9999 -9999 101 100 0845 657000 -180000
[14 rows x 12 columns]
使用Python 3.x进行测试。如果您使用的是Python 2.x,请更改以下内容:
with open('weather.txt', 'rb') as f_input, open('output.csv', 'wb') as f_output: