Question

我已将结果变量y设置为csv中的列。它正确加载并在我打印y时有效，但当我使用y = y[x:]时，我开始将NaN作为值。

y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[9:] #causes NaN for outcome variables

然后在文件中我打印结果列。 final_df是一个尚未设置结果变量的数据框，因此我将其设置如下：

final_df['outcome'] = y
print(final_df['outcome'])

但结果是：

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9      L

看起来最后一个值是正确的（它们都应该是'W'或'L'）。

如何正确排列我的数据框以便我没有NaN？

整个代码：

from sklearn.datasets import load_iris

from sklearn.ensemble import RandomForestClassifier

import pandas as pd

import numpy as np

import time

import matplotlib.pyplot as plt

np.random.seed(0)

from array import array

iris=load_iris()

previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
                         axis=1) #Predictor variables

X = previous_games_stats[['GF', 'GA']]
count = 0
final_df = pd.DataFrame(columns=['GF', 'GA'])

#final_y = pd.DataFrame(columns=['Unnamed: 7'])

y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack-1:]



for game in range(0, 10):
  X = previous_games_stats[['GF', 'GA']]
  X = X[count:numGamesToLookBack] #num games to look back
  stats_feature_names = list(X.columns.values)

  df = pd.DataFrame(iris.data, columns=iris.feature_names)

  stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
  final_df = final_df.append(stats_df, ignore_index=True)

  count+=1
  numGamesToLookBack+=1



print("final_df:\n", final_df)



stats_target_names = np.array(['Win', 'Loss']) #don't need?...just a label it looks like

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

final_df['outcome'] = y
final_df['outcome'].update(y) #ADDED UPDATE TO FIX NaN



df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #for iris


final_df['is_train'] = np.random.uniform(0, 1, len(final_df)) <= .65


train, test = df[df['is_train']==True], df[df['is_train']==False]
stats_train = final_df[final_df['is_train']==True]
stats_test = final_df[final_df['is_train']==False]


features = df.columns[:4]
stats_features = final_df.columns[:2]


y = pd.factorize(train['species'])[0]
stats_y = pd.factorize(stats_train['outcome'])[0]

clf = RandomForestClassifier(n_jobs=2, random_state=0)
stats_clf = RandomForestClassifier(n_jobs=2, random_state=0)


clf.fit(train[features], y)
stats_clf.fit(stats_train[stats_features], stats_y)

stats_clf.predict_proba(stats_test[stats_features])[0:10]



preds = iris.target_names[clf.predict(test[features])]
stats_preds = stats_target_names[stats_clf.predict(stats_test[stats_features])]




pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome'])
print("~~~confusion matrix~~~\nColumns represent what we predicted for the outcome of the game, and rows represent the actual outcome of the game.\n")
print(pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome']))

Answer 1

这是预期的，因为y没有第一个9值的索引（没有数据），所以在分配回来后得到NaN。

如果列是新的，y的长度与df的长度相同，则指定numpy数组：

final_df['outcome'] = y.values

但如果长度不同，则有点复杂，因为需要相同的长度：

df = pd.DataFrame({'a':range(10), 'b':range(20,30)}).astype(str).radd('a')
print (df)
    a    b
0  a0  a20
1  a1  a21
2  a2  a22
3  a3  a23
4  a4  a24
5  a5  a25
6  a6  a26
7  a7  a27
8  a8  a28
9  a9  a29

y = df['a']
y = y[4:]
print (y)
4    a4
5    a5
6    a6
7    a7
8    a8
9    a9
Name: a, dtype: object

<强> len(final_df) < len(y) ：

按y过滤final_df，然后转换为numpy数组，以便不对齐索引：

final_df = pd.DataFrame({'new':range(100, 105)})
final_df['s'] = y.iloc[:len(final_df)].values
print (final_df)
   new   s
0  100  a4
1  101  a5
2  102  a6
3  103  a7
4  104  a8

<强> len(final_df) > len(y) ：

通过过滤的Series值创建新的index：

final_df1 = pd.DataFrame({'new':range(100, 110)})
final_df1['s'] = pd.Series(y.values, index=final_df1.index[:len(y)])
print (final_df1)
   new    s
0  100   a4
1  101   a5
2  102   a6
3  103   a7
4  104   a8
5  105   a9
6  106  NaN
7  107  NaN
8  108  NaN
9  109  NaN

熊猫结果变量是NaN

1 个答案: