我需要处理一个很大的numpy.ndarray
。我想学习如何浏览数据。以下只是一小部分。
我试图通过使用多个for
- 循环和切片来解决问题,但不知何故我感到困惑。
你能帮我解决一下最后的任务吗?
列:
group;subgroup;value
1;1;356
1;2;403
1;3;370
2;2;488
2;3;568
2;4;562
2;5;478
3;1;415
3;2;418
3;3;388
3;4;414
任务:每组将每个值除以对应于最小子组的值。 所以我必须
答案 0 :(得分:1)
我想可能有一种更简单的方法,但使用Pandas
绝对是避免耗时循环的方法。
第1步:将numpy数组存入pandas数据框
import pandas as pd
x = [[1,1,365], [1,2,403], [1,3,370], [2,2,488],[2,3,568],[2,4,562], [3,1,415], [3,2,418], [3,3,388], [3,4,414]]
df = pd.DataFrame(x, columns = ["group", "subgroup", "value"])
print(df)
group subgroup value
0 1 1 365
1 1 2 403
2 1 3 370
3 2 2 488
4 2 3 568
5 2 4 562
6 3 1 415
7 3 2 418
8 3 3 388
9 3 4 414
第2步:运行groupby
方法,找到与每个组中最小子群对应的value
min_df = df.loc[df.groupby(["group"])["subgroup"].apply(np.argmin)]
min_df = min_df.drop(["subgroup"], axis =1) # Remove subgroup from this new table.
min_df.columns = ["group", "value_to_divide"] # Name columns correctly
print(min_df)
group value_to_divide
0 1 365
3 2 488
6 3 415
第3步:与原始数据框合并
df = pd.merge(df, min_df, how="left", on="group")
print(df)
group subgroup value value_to_divide
0 1 1 365 365
1 1 2 403 365
2 1 3 370 365
3 2 2 488 488
4 2 3 568 488
5 2 4 562 488
6 3 1 415 415
7 3 2 418 415
8 3 3 388 415
9 3 4 414 415
第4步:执行除法并转换回numpy数组(如果需要)
df["new_value"] = df.value/df.value_to_divide
print(df)
group subgroup value value_to_divide new_value
0 1 1 365 365 1.000000
1 1 2 403 365 1.104110
2 1 3 370 365 1.013699
3 2 2 488 488 1.000000
4 2 3 568 488 1.163934
5 2 4 562 488 1.151639
6 3 1 415 415 1.000000
7 3 2 418 415 1.007229
8 3 3 388 415 0.934940
9 3 4 414 415 0.997590
required = np.array(df[["group", "subgroup", "new_value"]])
答案 1 :(得分:1)
Pandas是一个很好的工具,它下面都是numpy,所以你可以通过调用numpy.ndarray
方法将你的pandas.DataFrame()
转换为pandas数据框。以下是您可以在终端中运行的示例:
import numpy as np
import pandas as pd
# Dict to Turn into Dataframe
data = {
"group": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
"subgroup": [1, 2, 3, 2, 3, 4, 5, 1, 2, 3, 4],
"value": [356, 403, 370, 488, 568, 562, 478, 415, 418, 388, 414]
}
# Convert to DataFrame
df = pd.DataFrame(data)
normed = []
print("There are %s unique groups in the data" % len(df["group"].unique()))
# Group DataFrame by 'group' column
for i, group in df.groupby("group"):
# Unique Subgroups in Group
print("Group %d has %d unique subgroups" % (i, len(group["subgroup"].unique())))
# Minimum value for a subgroup in group
print("The minimum value for a subgroup in group %d is %0.1f" % (i, min(group["value"])))
# Apply normalization / divide by min
gnormed = group["value"] / min(group["value"])
normed.extend(gnormed)
df["normed"] = normed
# See what the DataFrame looks like
print(df)
哪个会输出:
There are 3 unique groups in the data
Group 1 has 3 unique subgroups
The minimum value for a subgroup in group 1 is 356.0
Group 2 has 4 unique subgroups
The minimum value for a subgroup in group 2 is 478.0
Group 3 has 4 unique subgroups
The minimum value for a subgroup in group 3 is 388.0
group subgroup value normed
0 1 1 356 1.000000
1 1 2 403 1.132022
2 1 3 370 1.039326
3 2 2 488 1.020921
4 2 3 568 1.188285
5 2 4 562 1.175732
6 2 5 478 1.000000
7 3 1 415 1.069588
8 3 2 418 1.077320
9 3 3 388 1.000000
10 3 4 414 1.067010
当然,你确实在纯粹的numpy中要求这个。这是你可以做到的一种方式:
data = {
"group": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
"subgroup": [1, 2, 3, 2, 3, 4, 5, 1, 2, 3, 4],
"value": [356, 403, 370, 488, 568, 562, 478, 415, 418, 388, 414]
}
df = pd.DataFrame(data)
# Get Numpy Array from Pandas Object
array = df.values
# If you're not using Pandas, the relevant code starts here
# First, Get Unique Groups (with 0 as the index of the group column)
stacks = []
uniqueGroups = np.unique(array[:,0])
for groupIndex in uniqueGroups:
# Get Group Data
group = array[np.where(array[:,0] == groupIndex)]
# Get Unique Subgroups (with 1 as the index of the subgroup column)
uniqueSubgroup = np.unique(group[:,1])
# Get Min Group Value (with 2 as the index of the values column)
minVal = np.min(group[:,2])
# Compute normed values
normed = np.expand_dims(np.divide(group[:,2], minVal), 1)
# Concatenate the normed values with the group array
stacks.append(np.hstack((group, normed)))
# Concatenate groups back together with normed data and overwrite original numpy array
array = np.vstack(stacks)
# Print the example array
print(array)
将输出:
[[ 1. 1. 356. 1. ]
[ 1. 2. 403. 1.13202247]
[ 1. 3. 370. 1.03932584]
[ 2. 2. 488. 1.0209205 ]
[ 2. 3. 568. 1.18828452]
[ 2. 4. 562. 1.17573222]
[ 2. 5. 478. 1. ]
[ 3. 1. 415. 1.06958763]
[ 3. 2. 418. 1.07731959]
[ 3. 3. 388. 1. ]
[ 3. 4. 414. 1.06701031]]