Question

我有一个data.txt存储为数据框：

3100  0.000065  0.002070    0.000683    0.000869    0.001768
3211  0.003847  0.002695    0.025881    0.001689    0.012510
1211  0.006311  0.002108    0.000508    0.000301    0.022534
...

第一列是id，以下是属性向量。如何快速读取整个文件并将以下列存储为向量，因此结果数据框应为：

item_id     attributes
 3100        [0.000065, 0.002070, 0.000683, 0.000869, 0.001768]
 3211        [0.003847, 0.002695, 0.025881, 0.001689, 0.012510]
 ...

你对此有任何想法吗？谢谢！

编辑：

item_id确实包含文字字符。因此numpy.loadtext不能直接使用。

Answer 1

更新答案

这适用于包含字符串的ID：

df = pd.read_csv('data.txt', index_col=0, delim_whitespace=True, header=None)
df2 = pd.DataFrame({'attributes': list(df.values)}, index=df.index)

现在：

>>> df2
                                              attributes
0                                                       
3100a   [6.5e-05, 0.00207, 0.000683, 0.000869, 0.001768]
3211b  [0.003847, 0.002695, 0.025881, 0.001689, 0.01251]
1211c  [0.006311, 0.002108, 0.000508, 0.000301, 0.022...

>>> df2.loc['3100a', 'attributes']
array([  6.50000000e-05,   2.07000000e-03,   6.83000000e-04,
         8.69000000e-04,   1.76800000e-03])

旧答案

您可以使用NumPy loadtxt并将结果转换为数据框：

data = np.loadtxt('data.txt')
df = pd.DataFrame({'attributes': list(data[:, 1:])}, index=data[:, 0].astype(int))

现在：

>>> df
                                             attributes
3100   [6.5e-05, 0.00207, 0.000683, 0.000869, 0.001768]
3211  [0.003847, 0.002695, 0.025881, 0.001689, 0.01251]
1211  [0.006311, 0.002108, 0.000508, 0.000301, 0.022...

>>> df.loc[3100, 'attributes']
array([  6.50000000e-05,   2.07000000e-03,   6.83000000e-04,
         8.69000000e-04,   1.76800000e-03])

如何用python-pandas读取部分行到数组？

1 个答案:

更新答案

旧答案