在Spark Scala中将inferSchema选项用作true时,为什么数据类型错误?

时间:2018-09-08 08:48:54

标签: scala apache-spark

我正在读取USA_Housing.csv文件,这些文件是 (平均面积收入,平均面积入室年龄,平均面积房间数,平均面积卧室数,面积人口,价格,地址) 除地址外,所有列均为数值 以此方式读取数据时:

import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().getOrCreate() val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv") data.printSchema()

printSchema的输出是:

 |-- Avg Area Income: string (nullable = true)
 |-- Avg Area House Age: string (nullable = true)
 |-- Avg Area Number of Rooms: double (nullable = true)
 |-- Avg Area Number of Bedrooms: double (nullable = true)
 |-- Area Population: double (nullable = true)
 |-- Price: double (nullable = true)
 |-- Address: string (nullable = true)

平均地区收入和地区房屋年龄都是字符串,但它们实际上是csv文件中的 double

当我通过ATOM打开数据时,显示为:

Avg Area Income,Avg Area House Age,Avg Area Number of Rooms,Avg Area Number of Bedrooms,Area Population,Price,Address
79545.45857431678,5.682861321615587,7.009188142792237,4.09,23086.800502686456,1059033.5578701235,"208 Michael Ferry Apt. 674
Laurabury, NE 37010-5101"
79248.64245482568,6.0028998082752425,6.730821019094919,3.09,40173.07217364482,1505890.91484695,"188 Johnson Views Suite 079
Lake Kathleen, CA 48958"

2 个答案:

答案 0 :(得分:2)

将multiLine设置为true应该可以。

private void button1_Click(object sender, EventArgs e) // result bottom

    {

        double box_In_Top_Left = Convert.ToDouble(textBox1.Text); // Right UPPER BOX
        double box_In_Down_Left = Convert.ToDouble(textBox2.Text); // Venstra Nederst string

        double box_In_Top_Right = Convert.ToDouble(textBox3.Text); // Højre OP string
        double box_In_Down_Right = Convert.ToDouble(textBox4.Text); // Højre Nederst String


        double whole = box_In_Down_Right * box_In_Down_Left; // Whole (Bottom Part of A fraction

        string whole_String = Convert.ToString(whole); // Converts the Whole to a string
        textBox7.Text = whole_String; // Shows the Answer in the box in the bottom right 

        double Calculation1 = box_In_Top_Left * box_In_Down_Right;  // Calculates the top lefts box result

        double Calculation2 = box_In_Top_Right * box_In_Down_Left; // Calculates the top right box Result

        double part = Calculation2 + Calculation1; // Calculates answer for the top box

        string part_String = Convert.ToString(part);


        if (part >= whole) // if the part is bigger then the whole
        {


            double Amount_Of_times_greater = part / whole;


            string string_Amount_Of_times_greater = Convert.ToString(Amount_Of_times_greater);

            double Ekstra_greatnes = part / Amount_Of_times_greater;

            textBox6.Text = string_Amount_Of_times_greater;
            double Part_Whole = (part / Amount_Of_times_greater);


            if (Ekstra_greatnes == whole)
            {

                Part_Whole = Part_Whole - whole;
                string string_Part_Whole = Convert.ToString(Part_Whole);

                textBox8.Text = string_Part_Whole;
            }
            else
            {
                string string_Part_Whole = Convert.ToString(Part_Whole);
                textBox8.Text = string_Part_Whole;
            }


        }
        else // For if the the part is not bigger then the whole
        {

            textBox8.Text = part_String; // Displayes part in the box in the right corner
        }

    }

答案 1 :(得分:0)

csv(来自kaggle)格式不正确,地址列中有换行符。因此,第一列实际上被解析为:

+------------------+
|               _c0|
+------------------+
| 79545.45857431678|
|         Laurabury|
| 79248.64245482568|
|     Lake Kathleen|
|61287.067178656784|
|        Danieltown|
| 63345.24004622798|
|     FPO AP 44820"|
|59982.197225708034|
|     FPO AE 09386"|

因此spark将其作为字符串