Sparklyr忽略行分隔符

时间:2017-10-13 19:01:14

标签: r csv sparklyr

我试图在闪光灯中读取2GB~(5mi线)的.csv:

bigcsvspark <- spark_read_csv(sc, "bigtxt", "path", 
                              delimiter = "!",
                              infer_schema = FALSE,
                              memory = TRUE,
                              overwrite = TRUE,
                              columns = list(
                                  SUPRESSED COLUMNS AS = 'character'))

收到以下错误:

Job aborted due to stage failure: Task 9 in stage 15.0 failed 4 times, most recent failure: Lost task 9.3 in stage 15.0 (TID 3963,
10.1.4.16):  com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content: ---lines of my csv---[\n]
---begin of a splited line --- Parser Configuration: CsvParserSettings:     ... default settings ...

CsvFormat:
    Comment character=\0
    Field delimiter=!
    Line separator (normalized)=\n
    Line separator sequence=\n
    Quote character="
    Quote escape character=\
    Quote escape escape character=null Internal state when error was thrown:
        line=10599, 
        column=6, 
        record=8221, 
        charIndex=4430464, 
        headers=[---SUPRESSED HEADER---], 
        content parsed=---more lines without the delimiter.---

如上所示,某些点开始忽略行分隔符。在纯R中可以毫无问题地读取,只需read.csv传递路径和分隔符。

1 个答案:

答案 0 :(得分:1)

看起来该文件实际上不是CSV,我想知道spark_read_text()在这种情况下是否会更好用。你应该能够将所有的行都带入Spark,并将这些行分成内存中的字段,最后一部分将是最棘手的。