Question

I have a NumPy 2D array as shown below:

data.dat

X1  X2  X3  X4
1   1   1   1
2   2   4   2
3   3   9   3
4   4   16  4
5   5   25  5
6   6   36  6
7   nan 49  7
8   nan 64  8
9   nan 81  nan
10  nan nan nan

Now how do I output the last element of each column ignoring nan in the array. I tried without success the code:

A[~np.isnan(A)][-1]

Code used

import numpy as np
with open('data.dat', "r") as data:
    while True:
        line = data.readline()
        if not line.startswith('#'):
            break
    data_header = [i for i in line.strip().split('\t') if i]
A = np.genfromtxt('data.dat', names = data_header, dtype = float, delimiter = '\t')

Answer 1

If A were a plain NumPy array of dtype 'float' (instead of a structured array) then you could use

import numpy as np
nan = np.nan
A = np.array([[  1.,   1.,   1.,   1.],
              [  2.,   2.,   4.,   2.],
              [  3.,   3.,   9.,   3.],
              [  4.,   4.,  16.,   4.],
              [  5.,   5.,  25.,   5.],
              [  6.,   6.,  36.,   6.],
              [  7.,  nan,  49.,   7.],
              [  8.,  nan,  64.,   8.],
              [  9.,  nan,  81.,  nan],
              [ 10.,  nan,  nan,  nan]])

print(A[(~np.isnan(A)).cumsum(axis=0).argmax(axis=0), np.arange(A.shape[1])])

which yields

array([ 10.,   6.,  81.,   8.])

Given the structured array, such as the one generated by

import numpy as np

with open('data.dat', "r") as data:
    # per Padraic Cunningham's suggestion
    A = np.genfromtxt("data.dat", names=True, delimiter = '\t')

I think the easiest way to obtain the desired result is to view the structured array as a plain NumPy array of dtype 'float':

B = A.view('float').reshape(A.shape[0], -1)

and then proceed as before:

print(B[(~np.isnan(B)).cumsum(axis=0).argmax(axis=0), np.arange(B.shape[1])])

How it works:

Given a plain NumPy array of dtype 'float', such as

In [357]: B
Out[357]: 
array([[ nan,  nan,  nan,  nan],
       [  1.,   1.,   1.,   1.],
       [  2.,   2.,   4.,   2.],
       [  3.,   3.,   9.,   3.],
       [  4.,   4.,  16.,   4.],
       [  5.,   5.,  25.,   5.],
       [  6.,   6.,  36.,   6.],
       [  7.,  nan,  49.,   7.],
       [  8.,  nan,  64.,   8.],
       [  9.,  nan,  81.,  nan],
       [ 10.,  nan,  nan,  nan]])

we can use np.isnan to find where the non-nan values are:

In [358]: ~np.isnan(B)
Out[358]: 
array([[False, False, False, False],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True, False],
       [ True, False, False, False]], dtype=bool)

Now we can use cumsum to compute a cumulative sum for each column (False is treated as 0, True as 1):

In [359]: (~np.isnan(B)).cumsum(axis=0)
Out[359]: 
array([[ 0,  0,  0,  0],
       [ 1,  1,  1,  1],
       [ 2,  2,  2,  2],
       [ 3,  3,  3,  3],
       [ 4,  4,  4,  4],
       [ 5,  5,  5,  5],
       [ 6,  6,  6,  6],
       [ 7,  6,  7,  7],
       [ 8,  6,  8,  8],
       [ 9,  6,  9,  8],
       [10,  6,  9,  8]])

Notice that the maximum value in each column is due to the value achieved by the last True in each column.

Therefore, we can find the index corresponding the the first occurrance of the maximum value in each column by using np.argmax:

In [360]: (~np.isnan(B)).cumsum(axis=0).argmax(axis=0)
Out[360]: array([10,  6,  9,  8])

This gives the row index number for each column. To find the corresponding value in the array we could then use:

In [361]: B[(~np.isnan(B)).cumsum(axis=0).argmax(axis=0), np.arange(B.shape[1])]
Out[361]: array([ 10.,   6.,  81.,   8.])

Answer 2

You could try:

import numpy as np

ss = """1   1   1   1
        2   2   4   2
        3   3   9   3
        4   4   16  4
        5   5   25  5
        6   6   36  6
        7   nan 49  7
        8   nan 64  8
        9   nan 81  nan
        10  nan nan nan"""

# build Matrix from string:
AA = np.array([[float(c) for c in rw.split()]
               for rw in ss.splitlines()])

# extract last finte column element:
cl_finite = np.array([cl[np.isfinite(cl)][-1] for cl in AA.T])
print(cl_finite)

Answer 3

This works fine, at least if all columns have a non-nan element.

In [36]: [l[~np.isnan(l)][-1] for l in A.transpose()]
Out[36]: [10.0, 6.0, 81.0, 8.0]

How to output last column element of NumPy 2D array ignoring nan in Python?

3 个答案: