将.txt Spark输出转换为.csv

时间:2018-10-25 06:43:29

标签: java apache-spark bigdata rdd apache-spark-dataset

当前,我正在从.txt文件中的spark作业获取输出。我正在尝试将其转换为.csv

.txt输出 import io.circe.generic.JsonCodec import monocle.macros.Lenses @JsonCodec @Lenses("_") case class Test(a: Int) object Test

(Dataset <String>)

.csv输出

John MIT Bachelor ComputerScience Mike UB Master ComputerScience

我试图将其收集到一个列表中,但不确定如何将其转换为.csv并添加标题。

1 个答案:

答案 0 :(得分:0)

这是一种简单的方法,可将txt输出数据转换为数据结构(可以轻松地写入csv文件)。

基本思想是使用数据结构以及标题/列的数量,以便从一个线性txt输出中解析条目集。

看看代码注释,每个 TODO 4 U” 都对您有用,主要是因为我无法真正猜出您在这些位置上需要做什么在代码中(例如如何获取标头)。

  

这只是一个主要方法,可以直接工作。您可能想了解它的作用,并进行更改以使代码符合您的要求。输入和输出只是您自己创建,接收或处理的String

public static void main(String[] args) {

    // TODO 4 U: get the values for the header somehow
    String headerLine = "NAME, UNIV, DEGREE, COURSE";

    // TODO 4 U: read the txt output
    String txtOutput = "John MIT Bachelor ComputerScience Mike UB Master ComputerScience";

    /*
     * then split the header line
     * (or do anything similar, I don't know where your header comes from)
     */
    String[] headers = headerLine.split(", ");

    // store the amount of headers, which is the amount of columns
    int amountOfColumns = headers.length;

    // split txt output data by space
    String[] data = txtOutput.split(" ");

    /*
     * declare a data structure that stores lists of Strings,
     * each one is representing a line of the csv file
     */
    Map<Integer, List<String>> linesForCsv = new TreeMap<Integer, List<String>>();

    // get the length of the txt output data
    int a = data.length;

    // create a list of Strings containing the headers and put it into the data structure
    List<String> columnHeaders = Arrays.asList(headers);
    linesForCsv.put(0, columnHeaders);

    // declare a line counter for the csv file
    int l = 0;
    // go through the txt output data in order to get the lines for the csv file
    for (int i = 0; i < a; i++) {
        // check if there is a new line to be created
        if (i % amountOfColumns == 0) {
            /*
             * every time the amount of headers is reached,
             * create a new list for a new line in the csv file
             */
            l++; // increment the line counter (even at 0 because the header row is inserted at 0)
            linesForCsv.put(l, new ArrayList<String>()); // create a new line-list
            linesForCsv.get(l).add(data[i]); // add the data to the line-list
        } else {
            // if there is no new line to be created, store the data in the current one
            linesForCsv.get(l).add(data[i]);
        }
    }

    // print the lines stored in the map
    // TODO 4 U: write this to a csv file instead of just printing it to the console
    linesForCsv.forEach((lineNumber, line) -> {
        System.out.println("Line " + lineNumber + ": " + String.join(",", line));
    });
}