我对从Echo-sounder收集的测深数据感到困惑。 看起来像这样:
ID No Time Lat Lon Alt East North Count Fix
LL 0 589105179.00 24.156741 -110.321346 -31.50 4898039.453 -3406895.053 9 2
ED 0 1.12 0.00
ED 0 1.53 0.00
ED 0 1.60 0.00
ED 0 1.08 0.00
ED 0 1.51 0.00
ED 0 1.06 0.00
LL 0 589105180.00 24.156741 -110.321346 -31.50 4898039.836 -3406894.045 9 2
ED 0 1.06 0.00
ED 0 1.12 0.00
ED 0 0.98 0.00
ED 0 0.96 0.00
ED 0 0.91 0.00
ED 0 0.90 0.00
LL 0 589105181.00 24.156741 -110.321346 -31.50 4898039.433 -3406894.003 9 2
ED 0 1.04 0.00
ED 0 1.04 0.00
ED 0 0.93 0.00
ED 0 0.99 0.00
ED 0 0.99 0.00
ED 0 1.01 0.00
LL 0 589105182.00 24.156741 -110.321346 -31.51 4898038.460 -3406894.841 9 2
ED 0 0.99 0.00
ED 0 0.96 0.00
ED 0 0.96 0.00
ED 0 0.96 0.00
ED 0 0.98 0.00
ED 0 0.98 0.00
LL 0 589105183.00 24.156741 -110.321346 -31.51 4898039.804 -3406894.107 9 2
ED 0 1.01 0.00
ED 0 1.01 0.00
ED 0 0.91 0.00
ED 0 1.04 0.00
ED 0 1.04 0.00
ED 0 0.96 0.00
每个LL行都给出了下一个ED行水深测量的时间(自2000年以来的秒数),坐标,方向等。
我们需要计算每个ED度量的平均值并将其分配给LL行。问题在于,在完整文件中,ED度量值并不总是6,有时是5或4。
到目前为止,我已经做到了:
data = pd.read_csv('Echosounder.txt', sep = '\t')
LLs = data[data['ID'] == 'LL']
EDs = data[data['ID'] == 'ED']
我对此很喜欢,因为它尊重索引顺序。 我注意到有不同数量的ED措施,因为这样做之后:
EDs.groupby(np.arange(len(EDs))//6).mean()
并将它们附加到LL,最后LL行没有测深值。
请帮助。
答案 0 :(得分:0)
看起来每个LL行中的时间都是唯一的。您可以将其用作分组密钥。首先,为所有LL行创建一个等于Time的新分组列:
data.loc[data['ID']=='LL', 'key'] = data['Time']
将每个键的最新值传播到ED行:
data['key'].ffill(inplace=True)
按新键分组,并将结果与LLs
DataFrame结合:
LLs.set_index('Time')\
.join(data[data['ID']=='ED']\
.groupby('key').mean()[['No','Time','Lat']], rsuffix='_mean')
# ID No Lat ... No_mean Time Lat_mean
#Time ...
#589105179.0 LL 0 24.156741 ... 0 1.316667 0.0
#589105180.0 LL 0 24.156741 ... 0 0.988333 0.0
#589105181.0 LL 0 24.156741 ... 0 1.000000 0.0
#589105182.0 LL 0 24.156741 ... 0 0.971667 0.0
#589105183.0 LL 0 24.156741 ... 0 0.995000 0.0
最后三列是平均值。
答案 1 :(得分:0)
DYZ有一个很好的答案。另外,如果您不想假设时间是唯一的,则可以类似的方式使用索引。
data['dummy'] = np.nan
data.loc[data['ID']=='LL', 'dummy'] = data.loc[data['ID']=='LL', 'dummy'].index
data['dummy'].ffill(axis=0, inplace=True)
LLs.set_index('dummy')\
.join(data[data['ID']=='ED']\
.groupby('dummy').mean()[['No','Time','Lat']], rsuffix='_mean')
答案 2 :(得分:0)
另一种方法是遍历每个项目,并将时间戳添加到LD行中。
import pandas as pd
df = pd.read_csv('data.csv', sep='\t', index_col=False)
df.head()
ID No Time Lat Lon Alt East North Count Fix timestamp ed_value
0 LL 0 5.891052e+08 24.156741 -110.321346 -31.5 4898039.453 -3406895.053 9.0 2.0 589105179.0 NaN
1 ED 0 1.120000e+00 0.000000 NaN NaN NaN NaN NaN NaN 589105179.0 1.12
2 ED 0 1.530000e+00 0.000000 NaN NaN NaN NaN NaN NaN 589105179.0 1.53
3 ED 0 1.600000e+00 0.000000 NaN NaN NaN NaN NaN NaN 589105179.0 1.60
4 ED 0 1.080000e+00 0.000000 NaN NaN NaN NaN NaN NaN 589105179.0 1.08
LLs = df[df['ID'] == 'LL']
EDs = df[df['ID'] == 'ED']
for x in df.iterrows():
if x[1]['ID'] == 'LL':
timestamp = x[1]['Time']
elif x[1]['ID'] == 'ED':
df.loc[x[0], 'ed_value'] = x[1]['Time']
df.loc[x[0], 'timestamp'] = timestamp
df.groupby('timestamp').mean()
No Time Lat Lon Alt East North Count Fix ed_value
timestamp
589105179.0 0 8.415788e+07 3.450963 -110.321346 -31.50 4898039.453 -3406895.053 9.0 2.0 1.316667
589105180.0 0 8.415788e+07 3.450963 -110.321346 -31.50 4898039.836 -3406894.045 9.0 2.0 0.988333
589105181.0 0 8.415788e+07 3.450963 -110.321346 -31.50 4898039.433 -3406894.003 9.0 2.0 1.000000
589105182.0 0 8.415788e+07 3.450963 -110.321346 -31.51 4898038.460 -3406894.841 9.0 2.0 0.971667
589105183.0 0 8.415788e+07 3.450963 -110.321346 -31.51 4898039.804 -3406894.107 9.0 2.0 0.995000
答案 3 :(得分:0)
两个答案。第一个假设每个LL行都有与之关联的ED行。
import numpy as np
IDcount = data.ID.values
b, c = np.unique(IDcount, return_inverse=True)
g = np.cumsum(c)
df['grps'] = g
mean_vals = \
data[data.ID == 'ED'][['ID', 'grps', 'Time']].groupby(['ID', 'grps']).mean().Time.values
df2 = data[data.ID == 'LL'].copy()
df2['ED_mean'] = mean_vals
df2:
ID No Time Lat Lon Alt East North Count Fix grp ED_mean
0 LL 0 589105179.0 24.156741 -110.321346 -31.50 4898039.453 -3406895.053 9.0 2.0 1 1.316667
7 LL 0 589105180.0 24.156741 -110.321346 -31.50 4898039.836 -3406894.045 9.0 2.0 2 0.988333
14 LL 0 589105181.0 24.156741 -110.321346 -31.50 4898039.433 -3406894.003 9.0 2.0 3 1.000000
21 LL 0 589105182.0 24.156741 -110.321346 -31.51 4898038.460 -3406894.841 9.0 2.0 4 0.971667
28 LL 0 589105183.0 24.156741 -110.321346 -31.51 4898039.804 -3406894.107 9.0 2.0 5 0.995000
-
-
这是一个类似的答案,它将正确地考虑到LL行没有ED行的情况。这不如第一个答案那么快。
remove_idx = [15, 16, 17, 18, 19, 20, 29, 30, 31, 32, 33, 34]
data2 = data.loc[~data.index.isin(remove_idx)].copy()
IDcount = data2.ID.values
b, c = np.unique(IDcount, return_inverse=True)
g = np.cumsum(c)
data['grps'] = g
grouping_df = data2[data2.ID == 'ED'][['ID', 'grps', 'Time']].copy()
grouped = grouping_df.groupby(['ID', 'grps']).mean()
grouped.reset_index(drop=False, inplace=True)
mean_df = grouped[['grps', 'Time']].copy()
mean_df.rename(columns={'Time': 'ED_mean'}, inplace=True)
LLs = data2[data2.ID == 'LL'].copy()
result_df = pd.merge(LLs, mean_df, on='grps', how='outer').set_index(LLs.index)
result_df:
ID No Time Lat Lon Alt East North Count Fix grps ED_mean
0 LL 0 589105179.0 24.156741 -110.321346 -31.50 4898039.453 -3406895.053 9.0 2.0 1 1.316667
7 LL 0 589105180.0 24.156741 -110.321346 -31.50 4898039.836 -3406894.045 9.0 2.0 2 0.988333
14 LL 0 589105181.0 24.156741 -110.321346 -31.50 4898039.433 -3406894.003 9.0 2.0 3 NaN
21 LL 0 589105182.0 24.156741 -110.321346 -31.51 4898038.460 -3406894.841 9.0 2.0 4 0.971667
28 LL 0 589105183.0 24.156741 -110.321346 -31.51 4898039.804 -3406894.107 9.0 2.0 5 NaN
答案 4 :(得分:0)
from itertools import count
from collections import defaultdict
from pandas.io.common import StringIO as sio
import pandas as pd
c = count()
text = dict(LL=[], ED=defaultdict(list))
with open('file.txt', 'r') as fh:
cols = fh.readline()
for line in fh.readlines():
k, t = line.split(None, 1)
if k == 'LL':
i = next(c)
text[k].append(line)
else:
text[k][i].append(t)
DataFrame
ll = pd.read_csv(sio('\n'.join([cols, *text['LL']])), delim_whitespace=True)
ed = pd.concat({
i: pd.read_csv(sio('\n'.join(v)), delim_whitespace=True, header=None)
for i, v in text['ED'].items()
}).mean(level=0).add_prefix('ed_')
ll.join(ed)
ID No Time Lat Lon Alt East North Count Fix ed_0 ed_1 ed_2
0 LL 0 589105179.0 24.156741 -110.321346 -31.50 4898039.453 -3406895.053 9 2 0 1.316667 0.0
1 LL 0 589105180.0 24.156741 -110.321346 -31.50 4898039.836 -3406894.045 9 2 0 0.988333 0.0
2 LL 0 589105181.0 24.156741 -110.321346 -31.50 4898039.433 -3406894.003 9 2 0 1.000000 0.0
3 LL 0 589105182.0 24.156741 -110.321346 -31.51 4898038.460 -3406894.841 9 2 0 0.971667 0.0
4 LL 0 589105183.0 24.156741 -110.321346 -31.51 4898039.804 -3406894.107 9 2 0 0.995000 0.0
答案 5 :(得分:0)
此解决方案不需要唯一的时间戳。将添加平均值列,并在单独的列中添加原始数据。如果您不需要“ ED度量”值,请删除所有称为ColID的内容。
LLs.index = np.arange(LLs.shape[0])
EDs = EDs[['No']]
EDs['MyID'] = np.nan
EDs['ColID'] = np.nan
last_row, new_id, col_id = -1, 0, 1
for row in EDs.iterrows():
current_row = row[0]
if current_row == last_row + 1:
EDs.loc[current_row, 'MyID'] = new_id
EDs.loc[current_row, 'ColID'] = col_id
col_id += 1
else:
col_id = 1
new_id += 1
EDs.loc[current_row, 'MyID'] = new_id
last_row = current_row
data_mean = pd.DataFrame(EDs.groupby('MyID')['No'].mean())
data_mean.rename(columns={'No': 'Mean'}, inplace=True)
EDs = pd.pivot_table(EDs, values='No', index='MyID', columns='ColID', aggfunc=np.sum)
LLs = LLs.merge(EDs, left_index=True, right_index=True)
LLs = LLs.merge(data_mean, left_index=True, right_index=True)