浏览一个大numpy.ndarray

时间:2017-07-19 18:20:21

标签: python arrays python-3.x numpy

我需要处理一个很大的numpy.ndarray。我想学习如何浏览数据。以下只是一小部分。 我试图通过使用多个for - 循环和切片来解决问题,但不知何故我感到困惑。 你能帮我解决一下最后的任务吗?

列:

group;subgroup;value
1;1;356
1;2;403
1;3;370
2;2;488
2;3;568
2;4;562
2;5;478
3;1;415
3;2;418
3;3;388
3;4;414

任务:每组将每个值除以对应于最小子组的值。 所以我必须

  1. 找出数组中有多少组(第0列)。这里:3
  2. 找出每个值存在多少个子组以及在哪里找到每个组的最小子组。这里:例如。第1组有3个小组。 1是最小的。
  3. 将所有子组值除以最小值的值,并将其插入到数组中。这将导致1; 1;(356/256)然后1; 2;(402/356)......

2 个答案:

答案 0 :(得分:1)

我想可能有一种更简单的方法,但使用Pandas绝对是避免耗时循环的方法。

第1步:将numpy数组存入pandas数据框

import pandas as pd
x = [[1,1,365], [1,2,403], [1,3,370], [2,2,488],[2,3,568],[2,4,562], [3,1,415], [3,2,418], [3,3,388], [3,4,414]]
df = pd.DataFrame(x, columns = ["group", "subgroup", "value"])
print(df)

   group  subgroup  value
0      1         1    365
1      1         2    403
2      1         3    370
3      2         2    488
4      2         3    568
5      2         4    562
6      3         1    415
7      3         2    418
8      3         3    388
9      3         4    414

第2步:运行groupby方法,找到与每个组中最小子群对应的value

min_df = df.loc[df.groupby(["group"])["subgroup"].apply(np.argmin)]
min_df = min_df.drop(["subgroup"], axis =1) # Remove subgroup from this new table.
min_df.columns = ["group", "value_to_divide"] # Name columns correctly
print(min_df)

   group  value_to_divide
 0      1              365
 3      2              488
 6      3              415

第3步:与原始数据框合并

df = pd.merge(df, min_df, how="left", on="group")
print(df)

   group  subgroup  value  value_to_divide
0      1         1    365              365
1      1         2    403              365
2      1         3    370              365
3      2         2    488              488
4      2         3    568              488
5      2         4    562              488
6      3         1    415              415
7      3         2    418              415
8      3         3    388              415
9      3         4    414              415

第4步:执行除法并转换回numpy数组(如果需要)

df["new_value"] = df.value/df.value_to_divide
print(df)

group  subgroup  value  value_to_divide  new_value
0      1         1    365              365   1.000000
1      1         2    403              365   1.104110
2      1         3    370              365   1.013699
3      2         2    488              488   1.000000
4      2         3    568              488   1.163934
5      2         4    562              488   1.151639
6      3         1    415              415   1.000000
7      3         2    418              415   1.007229
8      3         3    388              415   0.934940
9      3         4    414              415   0.997590

required = np.array(df[["group", "subgroup", "new_value"]])

答案 1 :(得分:1)

Pandas是一个很好的工具,它下面都是numpy,所以你可以通过调用numpy.ndarray方法将你的pandas.DataFrame()转换为pandas数据框。以下是您可以在终端中运行的示例:

import numpy as np
import pandas as pd

# Dict to Turn into Dataframe
data = {
    "group": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    "subgroup": [1, 2, 3, 2, 3, 4, 5, 1, 2, 3, 4],
    "value": [356, 403, 370, 488, 568, 562, 478, 415, 418, 388, 414]
}

# Convert to DataFrame
df = pd.DataFrame(data)
normed = []
print("There are %s unique groups in the data" % len(df["group"].unique()))
# Group DataFrame by 'group' column
for i, group in df.groupby("group"):
    # Unique Subgroups in Group
    print("Group %d has %d unique subgroups" % (i, len(group["subgroup"].unique())))
    # Minimum value for a subgroup in group
    print("The minimum value for a subgroup in group %d is %0.1f" % (i, min(group["value"])))
    # Apply normalization / divide by min
    gnormed = group["value"] / min(group["value"])
    normed.extend(gnormed)
df["normed"] = normed
# See what the DataFrame looks like
print(df)


哪个会输出:

There are 3 unique groups in the data
Group 1 has 3 unique subgroups
The minimum value for a subgroup in group 1 is 356.0
Group 2 has 4 unique subgroups
The minimum value for a subgroup in group 2 is 478.0
Group 3 has 4 unique subgroups
The minimum value for a subgroup in group 3 is 388.0

    group  subgroup  value    normed
0       1         1    356  1.000000
1       1         2    403  1.132022
2       1         3    370  1.039326
3       2         2    488  1.020921
4       2         3    568  1.188285
5       2         4    562  1.175732
6       2         5    478  1.000000
7       3         1    415  1.069588
8       3         2    418  1.077320
9       3         3    388  1.000000
10      3         4    414  1.067010

当然,你确实在纯粹的numpy中要求这个。这是你可以做到的一种方式:

data = {
    "group": [1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    "subgroup": [1, 2, 3, 2, 3, 4, 5, 1, 2, 3, 4],
    "value": [356, 403, 370, 488, 568, 562, 478, 415, 418, 388, 414]
}
df = pd.DataFrame(data)
# Get Numpy Array from Pandas Object
array = df.values

# If you're not using Pandas, the relevant code starts here
# First, Get Unique Groups (with 0 as the index of the group column)
stacks = []
uniqueGroups = np.unique(array[:,0])
for groupIndex in uniqueGroups:
    # Get Group Data
    group = array[np.where(array[:,0] == groupIndex)]
    # Get Unique Subgroups (with 1 as the index of the subgroup column)
    uniqueSubgroup = np.unique(group[:,1])
    # Get Min Group Value (with 2 as the index of the values column)
    minVal = np.min(group[:,2])
    # Compute normed values
    normed = np.expand_dims(np.divide(group[:,2], minVal), 1)
    # Concatenate the normed values with the group array
    stacks.append(np.hstack((group, normed)))

# Concatenate groups back together with normed data and overwrite original numpy array
array = np.vstack(stacks)
# Print the example array
print(array)

将输出:

[[   1.            1.          356.            1.        ]
 [   1.            2.          403.            1.13202247]
 [   1.            3.          370.            1.03932584]
 [   2.            2.          488.            1.0209205 ]
 [   2.            3.          568.            1.18828452]
 [   2.            4.          562.            1.17573222]
 [   2.            5.          478.            1.        ]
 [   3.            1.          415.            1.06958763]
 [   3.            2.          418.            1.07731959]
 [   3.            3.          388.            1.        ]
 [   3.            4.          414.            1.06701031]]