Random_state对准确性的贡献

时间:2020-08-25 18:29:47

标签: python machine-learning scikit-learn data-science

好的,这很有趣。 我执行了相同的代码几次,每次得到一个不同的accuracy_score。 我发现random_state期间没有使用任何train_test splitting值。因此我使用random_state=0并获得{%Accuracy_score)一致的82%。但... 然后我想尝试使用不同的random_state数字,然后设置random_state=128,然后Accuracy_score变为84%。 现在,我需要了解为什么会这样,random_state如何影响模型的准确性。 输出如下: 1>没有random_state:

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[90 22]
 [21 46]]
0.7597765363128491

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[104  16]
 [ 14  45]]
0.8324022346368715

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[90 18]
 [12 59]]
0.8324022346368715

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[99  9]
 [19 52]]
0.8435754189944135

2>,其random_state = 128(Accuracy_score = 84%)

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[106  13]
 [ 15  45]]
0.8435754189944135

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[106  13]
 [ 15  45]]
0.8435754189944135

3>,并且random_state = 0(准确率= 82%)

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[93 17]
 [15 54]]
0.8212290502793296

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[93 17]
 [15 54]]
0.8212290502793296

1 个答案:

答案 0 :(得分:1)

基本上,random_state将通过每次执行相同的精确数据拆分来确保每次代码输出相同的结果。这对于您最初的培训/测试拆分以及创建其他人可以完全复制的代码很有帮助。

以相同或不同的方式分割数据

首先要了解的是,如果您不使用random_state,那么每次都会不同地拆分数据,这意味着您的训练集和测试集将被不同。这可能不会带来很大的不同,但是会导致模型参数/准确度等方面的轻微变化。如果您每次都将random_state设置为相同的值,例如{{ 1}},那么每次将以相同的方式拆分数据。

每个random_state导致不同的拆分

要理解的第二件事是,每个 random_state=0的值将导致不同的拆分和不同的行为。因此,如果您希望能够复制结果,则需要将random_state保留为相同的值。

您的模型可以具有多个random_state件

要了解的第三件事是模型的多个部分可能具有随机性。例如,您的random_state可以接受train_test_split,但是random_state也可以。因此,为了每次都获得完全相同的结果,您需要为模型中每个具有随机性的模型设置RandomForestClassifier

结论

如果您使用random_state进行初始训练/测试拆分,则将需要设置一次并继续使用该拆分,以避免过度拟合测试集。

通常来说,您可以使用交叉验证来评估模型的准确性,而不必担心random_state

非常重要的一点是,您不应使用random_state来尝试提高模型的准确性。根据定义,这将导致模型过度拟合数据,并且不能对未见数据进行概括。