pandas

Question

我非常感谢您提供以下帮助。我从csv文件读取数据到列表列表，然后将其更改为numpy数组。但是我真的很努力将numpy数组中的一组值更改为浮点数，因为我想为每一行添加一组数字并将总计作为新元素插入每一行。

我能够更改它们并创建更改后的数据类型的副本，但我似乎无法就地执行（在原始numpy数组中）。

这是一个有关csv数据的外观以及我要实现的目标的小例子。

list_of_lists = [["Africa", "1990", "0", "", "32.6"], ["Asia", "2006", "32.4", "5.5", "46.6"],
                 ["Europe", "2011", "5.4", "", "55.4"]]

array = np.array(list_of_lists)

array[array == ""] = np.nan

print(array)

# This doesnt change it in place

array[:, 2:].astype(np.float32, copy=False)

# And this doesnt as well

array[:, 2:] = array[:,2:].astype(np.float32)

我读了几个与此类似的问题，但是没有一种方法对我有用。我认为这和设置copy = False一样容易，但显然并非如此。

如果有人可以向我解释一下，我将非常感激。

Answer 1

似乎您需要一个结构化的数组来处理多种数据类型

list_of_lists = [["Africa", "1990", "0", "", "32.6"], ["Asia", "2006", "32.4", "5.5", "46.6"],
                 ["Europe", "2011", "5.4", "", "55.4"]]

temp = np.array(list_of_lists)
temp[temp==''] = 0

dtypes = np.dtype([('name','S10'),
    ('val1', np.float),
    ('val2',np.float),
    ('val3',np.float),
    ('val4',np.float)])

array = np.array(list(map(tuple, temp)), dtype=dtypes)

# Now you can modify the structured array
array[['val3', 'val4']]=20
array[0]['name'] = 'Australia'

问题是您可以假装这些都是列，但答案是否定的，只是结构和形状是(3,)，我建议切换到pandas数据框。

import pandas as pd

array = pd.DataFrame(list_of_lists)
array.replace('', '0', inplace=True)
array[data.columns[2:]] = array[array.columns[2:]].astype(float)

array.dtypes

# 0 object
# 1 object
# 2 float64
# 3 float64
# 4 float64
# dtype: object

Answer 2

您不能就地更改dtype。

In [59]: arr = np.array(list_of_lists)                                                         
In [60]: arr                                                                                   
Out[60]: 
array([['Africa', '1990', '0', '', '32.6'],
       ['Asia', '2006', '32.4', '5.5', '46.6'],
       ['Europe', '2011', '5.4', '', '55.4']], dtype='<U6')

输入的常见dtype是字符串。

用nan替换“”会将字符串表示形式放置在数组中：

In [62]: arr[arr == ""] = np.nan                                                                                       
In [63]: arr                                                                                   
Out[63]: 
array([['Africa', '1990', '0', 'nan', '32.6'],
       ['Asia', '2006', '32.4', '5.5', '46.6'],
       ['Europe', '2011', '5.4', 'nan', '55.4']], dtype='<U6')

查看基础数据缓冲区的一部分：

In [64]: arr.tobytes()                                                                         
Out[64]: b'A\x00\x00\x00f\x00\x00\x00r\x00\x00\x00i\x00\x00\x00c\x00\x00\x00a\x00\x00\x001\x00\x00\x009\x00\x00\x009\x00\x00\....'

查看实际字符。

数组的一个切片是view，但是astype转换是一个新数组，具有自己的数据缓冲区。

In [65]: arr[:,2:]                                                                             
Out[65]: 
array([['0', 'nan', '32.6'],
       ['32.4', '5.5', '46.6'],
       ['5.4', 'nan', '55.4']], dtype='<U6')
In [66]: arr[:,2:].astype(float)                                                               
Out[66]: 
array([[ 0. ,  nan, 32.6],
       [32.4,  5.5, 46.6],
       [ 5.4,  nan, 55.4]])

在将Out[66]转换回字符串之前，您无法将arr写回到In [67]: arr = np.array(list_of_lists, dtype=object) In [68]: arr Out[68]: array([['Africa', '1990', '0', '', '32.6'], ['Asia', '2006', '32.4', '5.5', '46.6'], ['Europe', '2011', '5.4', '', '55.4']], dtype=object) In [69]: arr = np.array(list_of_lists, dtype=object) In [70]: arr[arr == ""] = np.nan In [71]: arr Out[71]: array([['Africa', '1990', '0', nan, '32.6'], ['Asia', '2006', '32.4', '5.5', '46.6'], ['Europe', '2011', '5.4', nan, '55.4']], dtype=object) In [72]: arr[:,2:] = arr[:,2:].astype(float) In [73]: arr Out[73]: array([['Africa', '1990', 0.0, nan, 32.6], ['Asia', '2006', 32.4, 5.5, 46.6], ['Europe', '2011', 5.4, nan, 55.4]], dtype=object)。

您可以创建一个对象dtype数组：

numpy

dtype仍然是object，但是元素的类型可以更改-这是因为object dtype是美化的（或降级的）列表。您会获得一些灵活性，但会失去大多数csv的数字速度。

另一个答案中所示的结构化数组（compound dtype）是另一种可能性。加载np.genfromtxt（使用In [153]: df = pd.DataFrame(list_of_lists) In [154]: df Out[154]: 0 1 2 3 4 0 Africa 1990 0 32.6 1 Asia 2006 32.4 5.5 46.6 2 Europe 2011 5.4 55.4 In [156]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 5 columns): 0 3 non-null object 1 3 non-null object 2 3 non-null object 3 3 non-null object 4 3 non-null object dtypes: object(5) memory usage: 248.0+ bytes）时，很容易制作这种数组。您仍然无法就地更改dtype。而且您不能在结构化数组的各个字段之间进行数学运算。

pandas

In [158]: df[2].astype(float)   
In [162]: df[4]=df[4].astype(float)

转换列dtypes：

nan

第3列需要进行In [164]: df Out[164]: 0 1 2 3 4 0 Africa 1990 0.0 32.6 1 Asia 2006 32.4 5.5 46.6 2 Europe 2011 5.4 55.4 In [165]: df.dtypes Out[165]: 0 object 1 object 2 float64 3 object 4 float64 dtype: object转换，然后才能进行转换。

pandas

这里有更好的numpy程序员；我将重点放在vals[ind]上。

更改numpy数组中一组值的dtype

2 个答案:

pandas