我有一个大型数据集(参见下面的示例格式),我需要做以下思考:
这是一个示例输入文件:data(文件名)
564991.15 7371277.89 0 1 1530 1 1 16.0225
564991.15 7371277.89 0 1 8250 1 1 14.4405
564991.15 7371277.89 0 2 1530 1 1 14.8637
564991.15 7371277.89 0 2 8250 1 1 14.8918
564991.17 7371277.89 0 3 1530 1 1 16.0002
564991.17 7371277.89 0 3 8250 1 1 15.4333
564991.04 7371276.76 0 4 1530 1 1 14.73
564991.04 7371276.76 0 4 8250 1 1 15.6138
564991.04 7371276.76 0 5 1530 1 1 16.2453
564991.04 7371276.76 0 5 8250 1 1 15.6138
这是我要知道的代码(目前我在calc中补充)
import os
import numpy as np
import pandas as pd
datadirectory = '/media/data'
os.chdir = 'datadirectory'
df = pd.read_csv('/media/data/data.dat')
sorted_data = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)
np.savetxt('sorted_data_rounded.dat', datas, fmt='%s', delimiter='\t')
但是他只给了我4列,没有四舍五入的数据......
答案 0 :(得分:2)
添加一半并投射astype
int
:
df = pd.read_csv('data.dat', header=None, sep='\s+')
In [2]: df
Out[2]:
0 1 2 3 4 5 6 7
0 564991.15 7371277.89 0 1 1530 1 1 16.0225
1 564991.15 7371277.89 0 1 8250 1 1 14.4405
2 564991.15 7371277.89 0 2 1530 1 1 14.8637
3 564991.15 7371277.89 0 2 8250 1 1 14.8918
4 564991.17 7371277.89 0 3 1530 1 1 16.0002
5 564991.17 7371277.89 0 3 8250 1 1 15.4333
6 564991.04 7371276.76 0 4 1530 1 1 14.7300
7 564991.04 7371276.76 0 4 8250 1 1 15.6138
8 564991.04 7371276.76 0 5 1530 1 1 16.2453
9 564991.04 7371276.76 0 5 8250 1 1 15.6138
df1 = df.groupby([0, 1, 4])[7].mean().reset_index()
df1['ints'] = (df1[7] + 0.5).astype(int)
In [5]: df1
Out[5]:
0 1 4 7 ints
0 564991.04 7371276.76 1530 15.48765 15
1 564991.04 7371276.76 8250 15.61380 16
2 564991.15 7371277.89 1530 15.44310 15
3 564991.15 7371277.89 8250 14.66615 15
4 564991.17 7371277.89 1530 16.00020 16
5 564991.17 7371277.89 8250 15.43330 15
注意:您可以使用DataFrame方法to_csv
保存数据框。
答案 1 :(得分:1)
使用圆函数()
x = round(number to round , number of decimal places to round the number to )
答案 2 :(得分:0)
这段代码完全符合我的要求:
import os
import numpy as np
import pandas as pd
datadirectory = '/media/DATA'
os.chdir( datadirectory)
df = pd.read_csv('/media/DATA/data.dat', sep="\\s+", header=None)
df1 = df.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
df1['X.3'] = df["X.3"]
df1['X.4']=df["X.4"]
df1['X.6']=df["X.6"]
df1['X.7']=df["X.7"]
sorted_data = df1.reindex_axis(sorted(df1.columns), axis=1)
tuple_data = [tuple(x) for x in sorted_data.values]
datas = np.asarray(tuple_data)
dfround = df
dfround['X.1'] = df["X.1"].astype(int)
dfround['X.2'] = df["X.2"].astype(int)
df2 = dfround.groupby(["X.1","X.2","X.5"])["X.8"].mean().reset_index()
df2['X.3'] = df["X.3"] #add extra columns
df2['X.4']=df["X.4"]
df2['X.6']=df["X.6"]
df2['X.7']=df["X.7"]
sorted_data2 = df2.sort_index(axis=1) #rearragne data - method 2
tuple_data2 = [tuple(x) for x in sorted_data2.values]
datas2 = np.asarray(tuple_data2)
np.savetxt('sorted_data.dat', datas, fmt='%s', delimiter='\t') #Save the data
np.savetxt('sorted_rounded_data.dat', datas2, fmt='%s', delimiter='\t') #Save the data
print ('DONE')