Question

我有一个包含两列的pandas数据框：＆＃34;评论＆＃34;（文字）和＆＃34;情感＆＃34;（1/0）

X_train = df.loc[0:25000, 'review'].values
y_train = df.loc[0:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

但转换为numpy数组后，使用values()方法。我获得了以下形状的numpy数组：

print(df.shape)   #(50000, 2)
print(X_train.shape) #(25001,)
print(y_train.shape) #(25001,)
print(X_test.shape) # (25000,)
print(y_test.shape) # (25000,)

因此，您可以看到values()方法，添加了一行。这真的很奇怪，我无法发现错误。

Answer 1

df.loc是基于标签的，即它包括上限。使用iloc：

df.iloc[:25000, 1].values # here 1 is the column of 'review' for example

如果你想要类似NumPy的切片。

使用iloc，您需要以整数或整数形式提供行和列切片。

实施例

>>> df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
>>> df
   a  b
0  1  4
1  2  5
2  3  6

这是基于标签的，即上限包括：

>>> df.loc[:1, 'a']
0    1
1    2
Name: a, dtype: int64

这就像在NumPy中切片一样，即上限独占：

>>> df.iloc[:2, 0]
0    1
1    2
Name: a, dtype: int64

从Panda Dataframe转换为numpy数组时出现奇怪的错误

1 个答案:

实施例