好的,这很有趣。
我执行了相同的代码几次,每次得到一个不同的accuracy_score
。
我发现random_state
期间没有使用任何train_test splitting
值。因此我使用random_state=0
并获得{%Accuracy_score
)一致的82%。但...
然后我想尝试使用不同的random_state
数字,然后设置random_state=128
,然后Accuracy_score
变为84%。
现在,我需要了解为什么会这样,random_state
如何影响模型的准确性。
输出如下:
1>没有random_state:
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[90 22]
[21 46]]
0.7597765363128491
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[104 16]
[ 14 45]]
0.8324022346368715
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[90 18]
[12 59]]
0.8324022346368715
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[99 9]
[19 52]]
0.8435754189944135
2>,其random_state = 128(Accuracy_score = 84%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[106 13]
[ 15 45]]
0.8435754189944135
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[106 13]
[ 15 45]]
0.8435754189944135
3>,并且random_state = 0(准确率= 82%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[93 17]
[15 54]]
0.8212290502793296
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[93 17]
[15 54]]
0.8212290502793296
答案 0 :(得分:1)
基本上,random_state
将通过每次执行相同的精确数据拆分来确保每次代码输出相同的结果。这对于您最初的培训/测试拆分以及创建其他人可以完全复制的代码很有帮助。
首先要了解的是,如果您不使用random_state
,那么每次都会不同地拆分数据,这意味着您的训练集和测试集将被不同。这可能不会带来很大的不同,但是会导致模型参数/准确度等方面的轻微变化。如果您每次都将random_state
设置为相同的值,例如{{ 1}},那么每次将以相同的方式拆分数据。
要理解的第二件事是,每个 random_state=0
的值将导致不同的拆分和不同的行为。因此,如果您希望能够复制结果,则需要将random_state
保留为相同的值。
要了解的第三件事是模型的多个部分可能具有随机性。例如,您的random_state
可以接受train_test_split
,但是random_state
也可以。因此,为了每次都获得完全相同的结果,您需要为模型中每个具有随机性的模型设置RandomForestClassifier
。
如果您使用random_state
进行初始训练/测试拆分,则将需要设置一次并继续使用该拆分,以避免过度拟合测试集。
通常来说,您可以使用交叉验证来评估模型的准确性,而不必担心random_state
。
非常重要的一点是,您不应使用random_state
来尝试提高模型的准确性。根据定义,这将导致模型过度拟合数据,并且不能对未见数据进行概括。