Numpy版本

Question

我的文本文件如下：

将这些列标记为：

s1 a s2 r t

我还有另一个带有伪值的数组（为简单起见）

>>> V = np.array([10.,20.])

我想对这些数字进行某些计算，并获得良好的性能。我要执行的计算是：对于每个s1，我希望每个t*(r+V[s1])的最大和为a。例如，

对于s1 = 0，a = 0，我们的总和= 2 *（1 + 10）+ 1 *（3 + 10）= 35
对于s1 = 0，a = 1，我们的总和= 1 *（4 + 10）+ 3 *（2 + 10）= 50

因此，最大值为50，这是我想作为s1=0输出的值。
另外，请注意，在上述计算中，10是V[s1]。

如果文件中没有最后三行，那么对于s1=1，我将简单地返回3*(5+20)=75，其中20是V[s1]。所以最终的愿望结果是[50,75]

所以我认为numpy最好按以下方式加载它（为简单起见仅考虑s1 = 0的值）

>>> c1=[[   [ [0,1,2],[1,3,1] ],[ [0,4,1],[1,2,3] ]   ]]
>>> import numpy as np
>>> c1arr = np.array(c1)
>>> c1arr  #when I actually load from file, its not loading as this (check Q2 below)
array([[[[0, 1, 2],
         [1, 3, 1]],
        [[0, 4, 1],
         [1, 2, 3]]]])

>>> np.sum(c1arr[0,0][:,2]*(c1arr[0,0][:,1]+V))  #sum over t*(r+V)
45.0

Q1。我无法猜测，如何在上面进行修改以获得numpy数组[45.0,80.0]，以便可以在其上获取numpy.max。

第二季度。当我实际加载文件时，无法按照上面的注释所述将其加载为c1arr。取而代之的是，获取它的方式如下：

>>> type(a) #a is populated by parsing file
<class 'list'>

>>> print(a)
[[[[0, -0.9, 0.3], [1, 0.9, 0.6]], [[0, -0.2, 0.6], [1, 0.7, 0.3]]], [[[1, 0.2, 1.0]], [[0, -0.8, 1.0]]]]

>>> np.array(a) #note that this is not same as c1arr above
<string>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
array([[list([[0, -0.9, 0.3], [1, 0.9, 0.6]]),
        list([[0, -0.2, 0.6], [1, 0.7, 0.3]])],
       [list([[1, 0.2, 1.0]]),
        list([[0, -0.8, 1.0]])]], dtype=object)

我该如何解决？

第三季度。是否有总体上更好的方法，例如以不同的方式布置numpy数组？（鉴于我不允许使用熊猫，只能使用numpy）

Answer 1

我认为，最直观，最易维护的方法是使用 Pandas ，您可以在其中为列分配名称。另一个重要因素是，仅在 Pandas 中

，分组更加容易。

由于您的输入样本仅包含整数，因此我定义了 V 也可以是整数数组：

V = np.array([10, 20])

我读取您的输入文件如下：

df = pd.read_csv('Input.txt', sep=' ', names=['s1', 'a', 's2', 'r', 't'])

（打印以查看已读的内容）。

然后，要获取 s1 和 a 的每种组合的结果，您可以运行：

result = df.groupby(['s1', 'a']).apply(lambda grp:
    (grp.t * (grp.r + V[grp.s1])).sum())

请注意，当您引用命名列时，此代码易于阅读。

结果是：

s1  a
0   0     35
    1     50
1   0    138
    1    146
dtype: int64

每个结果都是整数，因为 V 也是一个数组 int 类型。但是，如果您按照帖子中的定义（ float 的数组），结果也将是 float 类型（您的选择）。

如果要获得每个 s1 的最大结果，请运行：

result.max(level=0)

这一次的结果是：

s1
0     50
1    146
dtype: int64

Numpy版本

如果您确实只限于 Numpy ，那么也有解决方案，虽然更难阅读和更新。

读取您的输入文件：
```
data = np.genfromtxt('Input.txt')
```
最初，我尝试了 int 类型，就像在 pandasonic 解决方案中一样，但是您的评论之一指出，最右边的2列是 float 。因此，由于 Numpy 数组必须为单个类型，因此整个数组必须为 float 类型。

运行以下代码：

res = []
# First level grouping - by "s1" (column 0)
for s1 in np.unique(data[:,0]).astype(int):
    dat1 = data[np.where(data[:,0] == s1)]
    res2 = []
    # Second level grouping - by "a" (column 1)
    for a in np.unique(dat1[:,1]):
        dat2 = dat1[np.where(dat1[:,1] == a)]
        # t - column 4, r - column 3
        res2.append((dat2[:,4] * (dat2[:,3] + V[s1])).sum())
    res.append([s1, max(res2)])
result = np.array(res)

结果（一个 Numpy 数组）为：

array([[  0.,  50.],
       [  1., 146.]])

左列包含 s1 值，右列-最大值分组来自第二级分组的值。

具有结构化数组的Numpy版本

实际上，您还可以使用 Numpy 结构化数组。然后，该代码至少更具可读性，因为您引用的是列名，不是列号。

读取通过 dtype 并带有列名和类型的数组：

data = np.genfromtxt(io.StringIO(txt), dtype=[('s1', '<i4'),
    ('a', '<i4'), ('s2', '<i4'), ('r', '<f8'), ('t', '<f8')])

然后运行：

res = []
# First level grouping - by "s1"
for s1 in np.unique(data['s1']):
    dat1 = data[np.where(data['s1'] == s1)]
    res2 = []
    # Second level grouping - by "a"
    for a in np.unique(dat1['a']):
        dat2 = dat1[np.where(dat1['a'] == a)]
        res2.append((dat2['t'] * (dat2['r'] + V[s1])).sum())
    res.append([s1, max(res2)])
result = np.array(res)

我如何更好地执行此numpy计算

1 个答案:

Numpy版本

具有结构化数组的Numpy版本