Question

情景

我已将csv（未分离）读入Dataframe，现在需要采用numpy数组格式进行群集而不更改类型

问题

到目前为止，根据尝试的参考文献（下文），我未能按要求获得输出。我试图获取的两列值的值在int64 / float64中，如下所示

         uid   iid       rat
0        196   242  3.000000
1        186   302  3.000000
2         22   377  1.000000

我暂时只对 iid 和 rat 感兴趣，并将其传递给Kmeans.fit（）方法，而且对于EPSILON也没有它。我需要以下面的格式

预期格式

[[242, 3.000000],
[302, 3.000000],
[22, 1.000000]]

不成功的尝试

X = values[:, 1:2]
Y = values[:, 2:3]
someArray = np.array([X,Y])
print someArray

并且没有告别执行

[[[  2.42000000e+02]
  [  3.02000000e+02]
  [  3.77000000e+02]
  ..., 
  [  1.35200000e+03]
  [  1.62600000e+03]
  [  1.65900000e+03]]
 [[  3.00000000e+00]
  [  3.00000000e+00]
  [  1.00000000e+00]
  ..., 
  [  1.00000000e+00]
  [  1.00000000e+00]
  [  1.00000000e+00]]]

目前尚未取消的参考资料

This one
This two
This three
This four

编辑1

尝试np_df = np.genfromtxt('AllData.csv', delimiter='\t', unpack=True)并获得此

[[             nan   1.96000000e+02   1.86000000e+02 ...,   4.79000000e+02
    4.79000000e+02   4.79000000e+02]
 [             nan   2.42000000e+02   3.02000000e+02 ...,   1.36000000e+03
    1.39400000e+03   1.65200000e+03]
 [             nan   3.00000000e+00   3.00000000e+00 ...,   2.00000000e+00
    1.92803605e+00   1.00000000e+00]]

Answer 1

使用基于标签的选择以及生成的.values对象的pandas属性，这些对象将是某种numpy数组：

>>> df
   uid  iid  rat
0  196  242  3.0
1  186  302  3.0
2   22  377  1.0
>>> df.loc[:,['iid','rat']]
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df.loc[:,['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])

注意，您的整数列将被提升为浮动。

另请注意，可以通过不同方式处理此特定选择：

>>> df.iloc[:, 1:] # integer-position based
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0
>>> df[['iid','rat']] # plain indexing performs column-based selection
   iid  rat
0  242  3.0
1  302  3.0
2  377  1.0

我喜欢基于标签，因为它更明确。

修改

您没有看到逗号的原因是如何打印numpy数组：

>>> df[['iid','rat']].values
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(df[['iid','rat']].values)
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

实际上，这是numpy数组的the difference between the str and repr结果：

>>> print(repr(df[['iid','rat']].values))
array([[ 242.,    3.],
       [ 302.,    3.],
       [ 377.,    1.]])
>>> print(str(df[['iid','rat']].values))
[[ 242.    3.]
 [ 302.    3.]
 [ 377.    1.]]

Answer 2

为什么不直接将'csv'作为numpy数组导入？

import numpy as np 
def read_file( fname): 
    return np.genfromtxt( fname, delimiter="/t", comments="%", unpack=True)

数据帧为numpy数组，其值为逗号分隔

情景

问题

编辑1

2 个答案:

修改