我有一个很大的.csv文件(2.3G),必须将其读入pandas数据框。
start_date,wind_90.0_0.0,wind_90.0_5.0,wind_87.5_2.5
1948-01-01,15030.64,15040.64,16526.35
1948-01-02,15050.14,15049.28,16526.28
1948-01-03,15076.71,15075.0,16525.28
我想将以上数据处理为以下结构:
start_date lat lon wind
0 1948-01-01 90.0 0.0 15030.64
1 1948-01-01 90.0 5.0 15040.64
2 1948-01-01 87.5 2.5 16526.35
3 1948-01-02 90.0 0.0 15050.14
4 1948-01-02 90.0 5.0 15049.28
5 1948-01-02 87.5 2.5 16526.28
6 1948-01-03 90.0 0.0 15076.71
7 1948-01-03 90.0 5.0 15075.0
8 1948-01-03 87.5 2.5 16525.28
到目前为止,我所拥有的代码可以满足我的要求,但是速度太慢并且占用大量内存。
def load_data_as_pandas(fileName, featureName):
df = pd.read_csv(fileName)
df = pd.melt(df, id_vars = df.columns[0])
df['lat'] = df['variable'].str.split('_').str[-2]
df['lon'] = df['variable'].str.split('_').str[-1]
df = df.drop('variable', axis=1)
df.columns = ['start_date', featureName,'lat','lon']
df = df.groupby(['start_date','lat','lon']).first()
df = df.reset_index()
df['start_date'] = pd.to_datetime(df['start_date'], format='%Y-%m-%d', errors='coerce')
return df
答案 0 :(得分:0)
这应该占用您的代码:
我们可以使用melt
从整个范围到整个范围取消透视。然后,在列名(值)上使用str.split
,并使用expand=True
为每个拆分获取一个新列。最后,我们将这些新创建的列返回到原始数据框:
melt = df.melt(id_vars='start_date').sort_values('start_date').reset_index(drop=True)
newcols = melt['variable'].str.split('_', expand=True).iloc[:, 1:].rename(columns={1:'lat', 2:'lon'})
final = melt.drop(columns='variable').join(newcols)
输出
start_date value lat lon
0 1948-01-01 15030.64 90.0 0.0
1 1948-01-01 15040.64 90.0 5.0
2 1948-01-01 16526.35 87.5 2.5
3 1948-01-02 15050.14 90.0 0.0
4 1948-01-02 15049.28 90.0 5.0
5 1948-01-02 16526.28 87.5 2.5
6 1948-01-03 15076.71 90.0 0.0
7 1948-01-03 15075.00 90.0 5.0
8 1948-01-03 16525.28 87.5 2.5
对80万行进行Timeit 测试:
3.55 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)