Tensorflow Haskell线性回归发散

时间:2019-06-12 10:17:57

标签: tensorflow haskell linear-regression gradient-descent

我一直在研究[tensorflow haskell bindings。但是,我很难 来自的基本线性回归示例 readme才能正常工作: 它似乎似乎很容易完成任务:学习y = 2x+3, 又称​​简单线性回归,使用梯度下降。 我制造了一个 github repo 包含一个可运行的示例(使用stack + nix),但这是要点:

-- | compute simple linear regression, using gradient descent on tensorflow
simpleLinearRegression' :: Float -> [Float] -> [Float] -> IO (Float, Float)
simpleLinearRegression' learningRate x y =
    TFL.withEventWriter "test.log" $ \eventWriter -> TF.runSession $ do
        let x' = TF.vector x
            y' = TF.vector y
        b0 <- TF.initializedVariable 0
        b1 <- TF.initializedVariable 0

        let yHat = (x' * TF.readValue b1) + TF.readValue b0
            loss = TFC.square $ yHat - y'

        TFL.histogramSummary "losses" loss
        TFL.scalarSummary "error" $ TF.reduceSum loss
        TFL.scalarSummary "intercept" $ TF.readValue b0
        TFL.scalarSummary "weight" $ TF.readValue b1

        trainStep <- TF.minimizeWith (TF.gradientDescent learningRate)
                                     loss
                                     [b0, b1]
        summaryT <- TFL.mergeAllSummaries
        forM_ ([1 .. iterations] :: [Int64]) $ \step -> do
            if step `mod` logEveryNth == 0
                then do
                   -- TF.run_ trainStep
                    ((), summaryBytes) <- TF.run (trainStep, summaryT)
                    (TF.Scalar beta0, TF.Scalar beta1) <- TF.run
                        (TF.readValue b0, TF.readValue b1)
                    -- liftIO $ putStrLn $ "Y  = " ++ show beta1 ++ "X + " ++ show beta0
                    let summary = decodeMessageOrDie (TF.unScalar summaryBytes)
                    TFL.logSummary eventWriter step summary
                else TF.run_ trainStep

        (TF.Scalar b0', TF.Scalar b1') <- TF.run (TF.readValue b0, TF.readValue b1)
        return (b0', b1')

这基本上是自述文件中的代码,在这里我打开了learningRate 到参数中并为张量板添加了一些日志记录(这对我没有帮助 理解问题所在。)

有一个小型测试套件展示了不同的情况:


linearRegressionSpec :: Spec
linearRegressionSpec = do
    -- n = 6 vs n = 7 on same x range: PASS vs FAIL (beta0, beta1: NaN)
    linearRegressionTest     0.01 3 2 $ equidist 6 1 6
    linearRegressionTest     0.01 3 2 $ equidist 7 1 6

    -- n = 6, larger x range: PASS vs FAIL
    linearRegressionTest     0.01 3 2 $ equidist 6 1 6
    linearRegressionTest     0.01 3 2 $ equidist 6 1 7

    -- n = 12 vs n = 13: PASS vs FAIL (beta0, beta1: NaN) (reduced learning rate)
    linearRegressionTest     0.005 3 2 $ equidist 12 1 6
    linearRegressionTest     0.005 3 2 $ equidist 13 1 6

    -- another one, different learning rate, but diverges with growing sample size.
    -- this is the learning rate used in the Readme.
    linearRegressionTest     0.001 3 2 $ equidist 26 1 10
    linearRegressionTest     0.001 3 2 $ equidist 27 1 10

    -- n = 99 vs n = 100, ranging from -1 to 1: PASS vs FAIL (beta1 estimate = 0)
    -- this one is different: the failing case does not diverge.
    linearRegressionTest     0.01 3 2 $ equidist 99  (-1) 1
    linearRegressionTest     0.01 3 2 $ equidist 100 (-1) 1
    linearRegressionTest     0.001 3 2 $ equidist 100 (-1) 1

    -- initial goal: fit linear regression on advertising data from ISLR, Chapter 3.1
    islrOLSSpec

-- | produce a list of n values equally distributed over the range (minX, maxX)
equidist :: Int -> Float -> Float -> [Float]
equidist n minX maxX =
    let n'  = fromIntegral $ n - 1
        f k = ((n' - k) * minX + k*maxX) / n'
    in f <$> [0 .. n']

roughlyEqual :: (Num a, Ord a, Fractional a) => a -> a -> Bool
roughlyEqual expected actual = 0.01 > abs (expected - actual)

-- switching between different implementations
-- fitFunction = Readme.fit
fitFunction = simpleLinearRegression'
-- fitFunction = simpleLinearRegressionMMH

linearRegressionTest :: Float -> Float -> Float -> [Float] -> Spec
linearRegressionTest learnRate beta0 beta1 xs = do
    let ys = (\x -> beta1*x + beta0) <$> xs
    it ("linear regression on one variable, n = "  ++
        show (length xs) ++ ", range (" ++ show (head xs) ++ ", " ++ show (last xs) ++ ")") $ do
            (beta0Hat, beta1Hat) <- fitFunction learnRate (fromList xs) (fromList ys)
            beta0Hat `shouldSatisfy` roughlyEqual beta0
            beta1Hat `shouldSatisfy` roughlyEqual beta1

我从这些中学到的东西:

  • 缩小学习速度可提高收敛性
  • 增加样本量会降低收敛性
  • 增加输入变量的离散度会降低收敛性

但是,我对这种行为感到困惑。我不希望这样的分歧 在我看来非常小的数据集方面,这是一个大问题。

问题:

  1. 代码是否有问题?
  2. 如果没有,您如何确定是否以及何时可以应用梯度下降?
  3. 是否有缓解策略(例如,对数据进行标准化)?
  4. 虽然我了解学习率和收敛之间的关系, 我对样本量和输入变量范围的影响感到惊讶。 是否有一些公式可以根据输入数据估算出良好的学习率?

在尝试拟合初始简单线性回归后,我开始了这项研究 Introduction to Statistical Learning第3.1章中的示例。 我可以得到一个示例(将销售量退还到电视上)以达到0.0000001的学习率, 需要非常多的步骤。

0 个答案:

没有答案