Spark mllib预测奇怪的数字或NaN

时间:2015-07-23 22:53:26

标签: python apache-spark pyspark apache-spark-mllib gradient-descent

我是Apache Spark的新手,并尝试使用机器学习库来预测一些数据。我现在的数据集只有大约350个点。以下是其中的7个点:

<textarea class="tip-question-response" rows="10" value='' maxlength="400" tabindex=-1></textarea>
<textarea class="tip-question-response" rows="10" value='' maxlength="400" tabindex=-1></textarea>
<textarea class="tip-question-response" rows="10" value='' maxlength="400" tabindex=-1></textarea>

这是我的代码:

public ObservableCollection<CustomClass> MyList { get; set; }


<ListBox x:Name="MyListBox"
    ScrollViewer.HorizontalScrollBarVisibility="Visible" ScrollViewer.VerticalScrollBarVisibility="Visible" 
    ItemsSource="{Binding MyList}">
        <ListBox.ItemTemplate>
                <DataTemplate>
                        <local:CustomElement x:Name="MyCustomElement"/>
                </DataTemplate>
        </ListBox.ItemTemplate>
        <ListBox.ItemsPanel>
                <ItemsPanelTemplate>
                        <WrapPanel Orientation="Horizontal"/>
                </ItemsPanelTemplate>
        </ListBox.ItemsPanel>
        <ListBox.ItemContainerStyle>
                <Style TargetType="{x:Type ListBoxItem}">
                    <Setter Property="OverridesDefaultStyle" Value="true" />
                    <Setter Property="Template">
                        <Setter.Value>
                            <ControlTemplate TargetType="ListBoxItem">
                                <Border x:Name="Border" BorderBrush="Transparent" Background="Black">
                                    <ContentPresenter />
                                </Border>
                            </ControlTemplate>
                        </Setter.Value>
                    </Setter>
                </Style>
            </ListBox.ItemContainerStyle>
</ListBox>

预测是完全疯狂的,如"365","4",41401.387,5330569 "364","3",51517.886,5946290 "363","2",55059.838,6097388 "362","1",43780.977,5304694 "361","7",46447.196,5471836 "360","6",50656.121,5849862 "359","5",44494.476,5460289 。如果我没有在def parsePoint(line): split = map(sanitize, line.split(',')) rev = split.pop(-2) return LabeledPoint(rev, split) def sanitize(value): return float(value.strip('"')) parsedData = textFile.map(parsePoint) model = LinearRegressionWithSGD.train(parsedData, iterations=10) print model.predict(parsedData.first().features) 中设置迭代,那么我得到-6.92840330273e+136。我究竟做错了什么?是我的数据集(可能是它的大小?)还是我的配置?

1 个答案:

答案 0 :(得分:7)

问题在于LinearRegressionWithSGD使用随机梯度下降(SGD)来优化线性模型的权向量。 SGD对提供的stepSize非常敏感,g用于更新中间解决方案。

SGD的作用是在给定输入点和当前权重w的样本的情况下计算成本函数的梯度w。为了更新权重g,您需要在s的相反方向上移动一定距离。距离是您的步长w(i+1) = w(i) - s * g

stepSize = 1

由于您未提供明确的步长值,因此MLlib假定为LinearRegressionWithSGD。这似乎不适用于您的用例。我建议您尝试不同的步长,通常是较低的值,以了解LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001) 的行为:

import java.util.Scanner;
public class Alpha 
{
    public static void main(String args[])
    {
        Scanner input = new Scanner(System.in);
         int n;    
        System.out.println("Enter no. of stars");
        n = input.nextInt();    
        loop(n); // added this

    }
    public static void loop (int n) // changed here
    {
        for (int counter = 1; counter <= n; counter++)
        {
            System.out.println("*");
        }
    }
}