在Flink

时间:2016-07-11 07:01:04

标签: csv apache-flink

在Flink中,使用readCsvFile解析CSV文件会在遇到包含"Fazenda São José ""OB"" Airport"等引号的字段时引发异常:

org.apache.flink.api.common.io.ParseException: Line could not be parsed: '191,"SDOB","small_airport","Fazenda São José ""OB"" Airport",-21.425199508666992,-46.75429916381836,2585,"SA","BR","BR-SP","Tapiratiba","no","SDOB",,"SDOB",,,'

我在this邮件列表主题和this JIRA问题中发现,字段内的引用应该通过\字符实现,但是我无法控制数据来修改它。有办法解决这个问题吗?

我也尝试过使用ignoreInvalidLines()(这是不太理想的解决方案),但它给了我以下错误:

08:49:05,737 INFO  org.apache.flink.api.common.io.LocatableInputSplitAssigner    - Assigning remote split to host localhost
08:49:05,765 ERROR org.apache.flink.runtime.operators.BatchTask                  - Error in task code:  CHAIN DataSource (at main(Job.java:53) (org.apache.flink.api.java.io.TupleCsvInputFormat)) -> Map (Map at main(Job.java:54)) -> Combine(SUM(1), at main(Job.java:56) (2/8)
java.lang.ArrayIndexOutOfBoundsException: -1
    at org.apache.flink.api.common.io.GenericCsvInputFormat.skipFields(GenericCsvInputFormat.java:443)
    at org.apache.flink.api.common.io.GenericCsvInputFormat.parseRecord(GenericCsvInputFormat.java:412)
    at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:111)
    at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:454)
    at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:79)
    at org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:176)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
    at java.lang.Thread.run(Thread.java:745)

这是我的代码:

DataSet<Tuple2<String, Integer>> csvInput = env.readCsvFile("resources/airports.csv")
            .ignoreFirstLine()
            .ignoreInvalidLines()
            .parseQuotedStrings('"')
            .includeFields("100000001")
            .types(String.class, String.class)
            .map((Tuple2<String, String> value) -> new Tuple2<>(value.f1, 1))
            .groupBy(0)
            .sum(1);

1 个答案:

答案 0 :(得分:0)

如果您无法更改输入数据,则应关闭parseQuotedString()。这将只查找下一个字段分隔符,并将其间的所有内容作为字符串(包括引号)返回。然后,您可以在后续的地图操作中删除前导和尾随引号。