在火花中读取csv时出现异常

时间:2016-08-04 10:20:46

标签: python apache-spark pyspark spark-csv

我很新兴。我有csv文件,只有2列。 csv文件很大(有3000万行)。我正在尝试使用spark-csv_2.10:1.2.0将其加载到数据帧。

我使用以下代码:

df = sqlContext.read.load('file:///path/file_third.csv', 
                           format='com.databricks.spark.csv', 
                           header='true', 
                           inferSchema='true')

我收到以下错误:

[Stage 2:>                                                         (0 + 8) / 10]16/08/04 15:32:57 ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 3)
com.univocity.parsers.common.TextParsingException: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). 
Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:

以下提示即将到来。我猜它不能理解换行。

Hint: Number of characters processed may have exceeded limit of 1000000 characters per column. Use settings.setMaxCharsPerColumn(int) to define the maximum number of characters a column can have
Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
    Auto configuration enabled=true
    Autodetect column delimiter=false
    Autodetect quotes=false
    Column reordering enabled=true
    Empty value=null
    Escape unquoted values=false
    Header extraction enabled=null
    Headers=[1235187239212711042, 0006]
    Ignore leading whitespaces=false
    Ignore trailing whitespaces=false
    Input buffer size=128
    Input reading on separate thread=false
    Keep escape sequences=false
    Line separator detection enabled=false
    Maximum number of characters per column=1000000
    Maximum number of columns=20480
    Normalize escaped line separators=true
    Null value=
    Number of records to read=all
    Row processor=none
    RowProcessor error handler=null
    Selected fields=none
    Skip empty lines=true
    Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
    CsvFormat:
        Comment character=\0
        Field delimiter=,
        Line separator (normalized)=\n
        Line separator sequence=\n
        Quote character="
        Quote escape character=\
        Quote escape escape character=null

如何解决此错误?

1 个答案:

答案 0 :(得分:0)

尝试将参数maxCharsPerColumn设置为更高的值。