使用java中的Sparks csv reader加载带有3个空格作为分隔符的数据文件

时间:2017-04-28 11:13:16

标签: java csv apache-spark spark-dataframe

我有一个数字值的数据文件,我试图读入,数据如下:

   1   6   4  12   5   5   3   4   1  67   3   2   1   2   1   0   0   1   0   0   1   0   0   1   1 
   2  48   2  60   1   3   2   2   1  22   3   1   1   1   1   0   0   1   0   0   1   0   0   1   2 

它由3个空格分隔。我想在Spark DataFrame中使用它。

我正在努力解析这个问题,它似乎把每一行都读成一个大字符串。

我已经厌倦了以下事情;

Dataset<Row> df = spark.read().format("com.databricks.spark.csv")
            .option("header", "false")
            .option("delimter", "\t")
            .load(csvFile);
    df.show(5);

也:

.option("delimter", "   ") // leads to java error that Delimter cant take more than one character

也很累.option("sep", "\t")而不是"delimter"

这是我的完整代码:

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class CreditRiskML {

static SparkSession spark = SparkSession.builder()
        .appName("Credit Risk ML")
        .master("local[*]")
        .config("spark.sql.warehouse.dir", "E:/Exp/")
        .getOrCreate();

public static double parseDouble(String str){
    return Double.parseDouble(str);
}



public static void main(String[] args){

    String csvFile = "input\\credit.data";
    Dataset<Row> df = spark.read().format("com.databricks.spark.csv")
            .option("header", "false")
            .option("delimter", "\t")
            .option("sep", "\t")
            .load(csvFile);
    df.show(5);

    //create RDD of type Credit

    JavaRDD<Credit> creditRdd = df.toJavaRDD().map(new Function<Row, Credit>() {
        @Override
        public Credit call(Row r) throws Exception {
            return new Credit(parseDouble(r.getString(0)), parseDouble(r.getString(1)) - 1,
                    parseDouble(r.getString(2)), parseDouble(r.getString(3)), parseDouble(r.getString(4)),
                    parseDouble(r.getString(5)), parseDouble(r.getString(6)) - 1, parseDouble(r.getString(7)) - 1,
                    parseDouble(r.getString(8)), parseDouble(r.getString(9)) - 1, parseDouble(r.getString(10)) - 1,
                    parseDouble(r.getString(11)) - 1, parseDouble(r.getString(12)) - 1,
                    parseDouble(r.getString(13)), parseDouble(r.getString(14)) - 1,
                    parseDouble(r.getString(15)) - 1, parseDouble(r.getString(16)) - 1,
                    parseDouble(r.getString(17)) - 1, parseDouble(r.getString(18)) - 1,
                    parseDouble(r.getString(19)) - 1, parseDouble(r.getString(20)) - 1);
        }
    });

    //Create a dataset of type Row from the RDD of type Credit
    Dataset<Row> creditData = spark.sqlContext().createDataFrame(creditRdd, Credit.class);

    creditData.show(5);

 }
}

错误消息:

java.lang.NumberFormatException: For input string: "1   6   4  12   5   5   3   4   1  67   3   2   1   2   1   0   0   1   0   0   1   0   0   1   1"

解决这个问题的最佳方法是什么? 非常感谢任何帮助。

这是P.s信用等级:

public class Credit {
    private double creditability;
    private double balance;
    private double duration;
    private double history;
    private double purpose;
    private double amount;
    private double savings;
    private double employment;
    private double instPercent;
    private double sexMarried;
    private double guarantors;
    private double residenceDuration;
    private double assets;
    private double age;
    private double concCredit;
    private double apartment;
    private double credits;
    private double occupation;
    private double dependents;
    private double hasPhone;
    private double foreign;

    public Credit(double creditability, double balance, double duration, double history, double purpose, double amount,
                  double savings, double employment, double instPercent, double sexMarried, double guarantors,
                  double residenceDuration, double assets, double age, double concCredit, double apartment, double credits,
                  double occupation, double dependents, double hasPhone, double foreign) {
        super();
        this.creditability = creditability;
        this.balance = balance;
        this.duration = duration;
        this.history = history;
        this.purpose = purpose;
        this.amount = amount;
        this.savings = savings;
        this.employment = employment;
        this.instPercent = instPercent;
        this.sexMarried = sexMarried;
        this.guarantors = guarantors;
        this.residenceDuration = residenceDuration;
        this.assets = assets;
        this.age = age;
        this.concCredit = concCredit;
        this.apartment = apartment;
        this.credits = credits;
        this.occupation = occupation;
        this.dependents = dependents;
        this.hasPhone = hasPhone;
        this.foreign = foreign;
    }

    public double getCreditability() {
        return creditability;
    }

    public void setCreditability(double creditability) {
        this.creditability = creditability;
    }

    public double getBalance() {
        return balance;
    }

    public void setBalance(double balance) {
        this.balance = balance;
    }

    public double getDuration() {
        return duration;
    }

    public void setDuration(double duration) {
        this.duration = duration;
    }

    public double getHistory() {
        return history;
    }

    public void setHistory(double history) {
        this.history = history;
    }

    public double getPurpose() {
        return purpose;
    }

    public void setPurpose(double purpose) {
        this.purpose = purpose;
    }

    public double getAmount() {
        return amount;
    }

    public void setAmount(double amount) {
        this.amount = amount;
    }

    public double getSavings() {
        return savings;
    }

    public void setSavings(double savings) {
        this.savings = savings;
    }

    public double getEmployment() {
        return employment;
    }

    public void setEmployment(double employment) {
        this.employment = employment;
    }

    public double getInstPercent() {
        return instPercent;
    }

    public void setInstPercent(double instPercent) {
        this.instPercent = instPercent;
    }

    public double getSexMarried() {
        return sexMarried;
    }

    public void setSexMarried(double sexMarried) {
        this.sexMarried = sexMarried;
    }

    public double getGuarantors() {
        return guarantors;
    }

    public void setGuarantors(double guarantors) {
        this.guarantors = guarantors;
    }

    public double getResidenceDuration() {
        return residenceDuration;
    }

    public void setResidenceDuration(double residenceDuration) {
        this.residenceDuration = residenceDuration;
    }

    public double getAssets() {
        return assets;
    }

    public void setAssets(double assets) {
        this.assets = assets;
    }

    public double getAge() {
        return age;
    }

    public void setAge(double age) {
        this.age = age;
    }

    public double getConcCredit() {
        return concCredit;
    }

    public void setConcCredit(double concCredit) {
        this.concCredit = concCredit;
    }

    public double getApartment() {
        return apartment;
    }

    public void setApartment(double apartment) {
        this.apartment = apartment;
    }

    public double getCredits() {
        return credits;
    }

    public void setCredits(double credits) {
        this.credits = credits;
    }

    public double getOccupation() {
        return occupation;
    }

    public void setOccupation(double occupation) {
        this.occupation = occupation;
    }

    public double getDependents() {
        return dependents;
    }

    public void setDependents(double dependents) {
        this.dependents = dependents;
    }

    public double getHasPhone() {
        return hasPhone;
    }

    public void setHasPhone(double hasPhone) {
        this.hasPhone = hasPhone;
    }

    public double getForeign() {
        return foreign;
    }

    public void setForeign(double foreign) {
        this.foreign = foreign;
    }
}

3 个答案:

答案 0 :(得分:1)

解决此问题的一种方法是使用java.util.Scanner。因为您正在使用空格,所以不需要指定分隔符。

String s = "1   0   2   0";
Scanner scanner = new Scanner(s);

while(scanner.hasNext()){
  System.out.println(scanner.next());
}

输出将是:

1
0
2
0

无论给定String中的空格数量如何,这都将起作用。

答案 1 :(得分:0)

你可以使用RDD

val conf: SparkConf = new SparkConf().setMaster("local[2]").setAppName("test")
val sc: SparkContext = new SparkContext(conf)

val rdd = sc.textFile(csvFile).map(_.split("  ").toList)

rdd.foreach(println)
  

列表(1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0) ,1,1)

     

列表(2,88,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0) ,1,2)

答案 2 :(得分:0)

您有一个手提包。是delimiter而不是delimter