Question

我有一个csv个文件，其中3列emotion, pixels, Usage包含35000行，例如0,70 23 45 178 455,Training。

我使用pandas.read_csv将csv文件读为pd.read_csv(filename, dtype={'emotion':np.int32, 'pixels':np.int32, 'Usage':str})。

当我尝试上述内容时，会说ValueError: invalid literal for long() with base 10: '70 23 45 178 455'？如何将像素列读作numpy数组？

Answer 1

请尝试使用以下代码 -

var sums = _(array).reduce(function(memo, e) {
    if(!memo.by_type[e.type]) {
        memo.by_type[e.type] = { type: e.type, count: 0 };
        memo.values.push(memo.by_type[e.type]);
    }
    memo.by_type[e.type].count += 1;
    return memo;
}, { by_type: { }, values: [ ] });
var what_you_want = sums.values;

Answer 2

我相信使用矢量化str方法分割字符串并根据需要创建新的像素列并将concat新列添加到新df中会更快：

In [175]:
# load the data
import pandas as pd
import io
t="""emotion,pixels,Usage
0,70 23 45 178 455,Training"""
df = pd.read_csv(io.StringIO(t))
df

Out[175]:
   emotion            pixels     Usage
0        0  70 23 45 178 455  Training

In [177]:
# now split the string and concat column-wise with the orig df
df = pd.concat([df, df['pixels'].str.split(expand=True).astype(int)], axis=1)
df
Out[177]:
   emotion            pixels     Usage   0   1   2    3    4
0        0  70 23 45 178 455  Training  70  23  45  178  455

如果你特别想要一个扁平的np数组，你可以调用.values属性：

In [181]:
df['pixels'].str.split(expand=True).astype(int).values

Out[181]:
array([[ 70,  23,  45, 178, 455]])

Answer 3

我遇到了同样的问题并想出了一个黑客。将您的数据文件保存为.npy文件。加载时，它将作为ndarray加载。您可以使用pandas.DataFrame将ndarray转换为数据帧供您使用。我发现这个解决方案比从字符串字段转换更容易。示例代码如下：

import numpy as np
import pandas as pd
np.save('file_name.npy',dataframe_to_be_saved)
#the dataframe is saved in 'file_name.npy' in your current working directory

#loading the saved file into an ndarray
arr=np.load('file_name.npy')
df=pd.DataFrame(data=arr[:,1:],index=n1[:,0],columns=column_names)

#df now stores your dataframe with the original datatypes

使用pandas

3 个答案: