我有一个Nx2矩阵,例如:
M = [[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]]
我需要创建一个Nx3矩阵,该矩阵以以下方式反映第一个矩阵中各行的关系:
使用右列标识范围边界的候选者,条件是值> = 1000
此条件适用于矩阵:
[[10, 1000],
[20, 5000],
[32, 3000],
[35, 3500],
[50, 5000],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000],]
到目前为止,我想出了“ M [M [:,1]> = 1000]”这个可行的方法。对于这个新矩阵,我现在要检查第一列中到下一个点的距离<= 10的点,并将它们用作范围边界。
到目前为止,我想出了什么:np.diff(M [:,0])<= 10,它返回:
[True, False, True, False, True, True, True, False]
这就是我被困住的地方。我想使用此条件来定义范围的上下边界。例如:
[[10, 1000], #<- Range 1 start
[20, 5000], #<- Range 1 end (as 32 would be 12 points away)
[32, 3000], #<- Range 2 start
[35, 3500], #<- Range 2 end
[50, 5000], #<- Range 3 start
[55, 2000], #<- Range 3 cont (as 55 is only 5 points away)
[58, 3000], #<- Range 3 cont
[66, 4000], #<- Range 3 end
[90, 5000]] #<- Range 4 start and end (as there is no point +-10)
最后,回到第一个矩阵,我想将边界(包括边界)内的每个范围的右列值加在一起。
因此,我有四个范围来定义边界的开始和结束。
结果矩阵如下所示,其中第0列是开始边界,第1列是结束边界,第2列是矩阵M从开始到结束之间的右列的相加值。
[[10, 20, 7000], # 7000 = 1000+200+800+5000
[32, 35, 6500], # 6500 = 3000+3500
[50, 66, 14100], # 14100 = 5000+100+2000+3000+4000
[90, 90, 5000]] # 5000 = just 5000 as upper=lower boundary
在获得范围边界的真/假值之后,我陷入了第二步。但是,对于我来说,如何从布尔值创建范围,然后在这些范围内将值加在一起还不清楚。将不胜感激任何建议。另外,我不确定我的方法,也许有更好的方法从第一个矩阵到最后一个矩阵,也许跳过了一步?
因此,我在中间步骤走得更远,现在我可以返回范围的开始和结束值:
start_diffs = np.diff(M[:,0]) > 10
start_indexes = np.insert(start_diffs, 0, True)
end_diffs = np.diff(M[:,0]) > 10
end_indexes = np.insert(end_diffs, -1, True)
start_values = M[:,0][start_indexes]
end_values = M[:,0][end_indexes]
print(np.array([start_values, end_values]).T)
返回:
[[10 20]
[32 35]
[50 66]
[90 90]]
缺少的是现在以某种方式使用这些范围来计算右列中矩阵M的总和。
答案 0 :(得分:1)
如果您愿意使用pandas
,那么回想起来,下面的解决方案似乎有点过头了,但是可以使用:
# Initial array
M = np.array([[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]])
# Build a DataFrame with default integer index and column labels
df = pd.DataFrame(M)
# Get a subset of rows that represent potential interval edges
subset = df[df[1] >= 1000].copy()
# If a row is the first row in a new range, flag it with 1.
# Then cumulatively sum these 1s. This labels each row with a
# unique integer, one per range
subset[2] = (subset[0].diff() > 10).astype(int).cumsum()
# Get the start and end values of each range
edges = subset.groupby(2).agg({0: ['first', 'last']})
edges
0
first last
2
0 10 20
1 32 35
2 50 66
3 90 90
# Build a pandas IntervalIndex out of these interval edges
tups = list(edges.itertuples(index=False, name=None))
idx = pd.IntervalIndex.from_tuples(tups, closed='both')
# Build a Series that maps each interval to a unique range number
mapping = pd.Series(range(len(idx)), index=idx)
# Apply this mapping to create a new column of the original df
df[2] = [mapping.loc[i] if idx.contains(i) else None for i in df[0]]
df
0 1 2
0 10 1000 0.0
1 11 200 0.0
2 15 800 0.0
3 20 5000 0.0
4 28 100 NaN
5 32 3000 1.0
6 35 3500 1.0
7 38 100 NaN
8 50 5000 2.0
9 51 100 2.0
10 55 2000 2.0
11 58 3000 2.0
12 66 4000 2.0
13 90 5000 3.0
# Group by this new column, get edges of each interval,
# sum values, and get the underlying numpy array
df.groupby(2).agg({0: ['first', 'last'], 1: 'sum'}).values
array([[ 10, 20, 7000],
[ 32, 35, 6500],
[ 50, 66, 14100],
[ 90, 90, 5000]])