我已将结果变量y
设置为csv中的列。它正确加载并在我打印y
时有效,但当我使用y = y[x:]
时,我开始将NaN
作为值。
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[9:] #causes NaN for outcome variables
然后在文件中我打印结果列。 final_df
是一个尚未设置结果变量的数据框,因此我将其设置如下:
final_df['outcome'] = y
print(final_df['outcome'])
但结果是:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 L
看起来最后一个值是正确的(它们都应该是'W'或'L')。
如何正确排列我的数据框以便我没有NaN?
整个代码:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
np.random.seed(0)
from array import array
iris=load_iris()
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
axis=1) #Predictor variables
X = previous_games_stats[['GF', 'GA']]
count = 0
final_df = pd.DataFrame(columns=['GF', 'GA'])
#final_y = pd.DataFrame(columns=['Unnamed: 7'])
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack-1:]
for game in range(0, 10):
X = previous_games_stats[['GF', 'GA']]
X = X[count:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
final_df = final_df.append(stats_df, ignore_index=True)
count+=1
numGamesToLookBack+=1
print("final_df:\n", final_df)
stats_target_names = np.array(['Win', 'Loss']) #don't need?...just a label it looks like
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
final_df['outcome'] = y
final_df['outcome'].update(y) #ADDED UPDATE TO FIX NaN
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #for iris
final_df['is_train'] = np.random.uniform(0, 1, len(final_df)) <= .65
train, test = df[df['is_train']==True], df[df['is_train']==False]
stats_train = final_df[final_df['is_train']==True]
stats_test = final_df[final_df['is_train']==False]
features = df.columns[:4]
stats_features = final_df.columns[:2]
y = pd.factorize(train['species'])[0]
stats_y = pd.factorize(stats_train['outcome'])[0]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
stats_clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(train[features], y)
stats_clf.fit(stats_train[stats_features], stats_y)
stats_clf.predict_proba(stats_test[stats_features])[0:10]
preds = iris.target_names[clf.predict(test[features])]
stats_preds = stats_target_names[stats_clf.predict(stats_test[stats_features])]
pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome'])
print("~~~confusion matrix~~~\nColumns represent what we predicted for the outcome of the game, and rows represent the actual outcome of the game.\n")
print(pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome']))
答案 0 :(得分:1)
这是预期的,因为y
没有第一个9
值的索引(没有数据),所以在分配回来后得到NaN
。
如果列是新的,y
的长度与df
的长度相同,则指定numpy数组:
final_df['outcome'] = y.values
但如果长度不同,则有点复杂,因为需要相同的长度:
df = pd.DataFrame({'a':range(10), 'b':range(20,30)}).astype(str).radd('a')
print (df)
a b
0 a0 a20
1 a1 a21
2 a2 a22
3 a3 a23
4 a4 a24
5 a5 a25
6 a6 a26
7 a7 a27
8 a8 a28
9 a9 a29
y = df['a']
y = y[4:]
print (y)
4 a4
5 a5
6 a6
7 a7
8 a8
9 a9
Name: a, dtype: object
<强> len(final_df) < len(y)
强>:
按y
过滤final_df
,然后转换为numpy数组,以便不对齐索引:
final_df = pd.DataFrame({'new':range(100, 105)})
final_df['s'] = y.iloc[:len(final_df)].values
print (final_df)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
<强> len(final_df) > len(y)
强>:
通过过滤的Series
值创建新的index
:
final_df1 = pd.DataFrame({'new':range(100, 110)})
final_df1['s'] = pd.Series(y.values, index=final_df1.index[:len(y)])
print (final_df1)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
5 105 a9
6 106 NaN
7 107 NaN
8 108 NaN
9 109 NaN