java中的TSV解释器

时间:2018-04-18 07:28:18

标签: java regex csv

我正在创建一段java代码来读取和解释tsv文件。我想找到一个能够在文件中分割行的正则表达式:

  • 项目由制表符分隔
  • 字符串被引号括起来
  • 数字未被引号括起来
  • 引号可以包含引号,引号将被转义(即双引号""
  • 字符串可以包含标签

示例输入行:

"aaa"    123    "bbb"    "cc"    "ddd"
"aaa"    123    "bbb"    "cc"    "    6"
"ddd"    456    "eee"    "ff"    "       ""     "
"ddd"    456    "eee"    "ff"    "    "" aaa ""   "

* (请注意:最后三个字符串中的标签)

我当前的正则表达式是("[^"]*"*|[^\t]+)+,但是在最后一个示例中失败了(使得更小的子字符串)

1 个答案:

答案 0 :(得分:0)

让我们解决这个问题:

\t(?=(?:\[^\"\]*\"\[^\"\]*\")*\[^\"\]*$)(点击链接获取模式说明)

示例代码:ideone demo

import java.util.regex.Pattern;
public class example {
  public static void main(String[] asd){
  String sourcestring = "\"aaa\"    123 \"bbb\" \"cc\"  \"ddd\"\n"
             + "\"aaa\" 123 \"bbb\" \"cc\"  \"  6\"\n"
             + "\"ddd\" 456 \"eee\" \"ff\"  \"          \"\"     \"\n"
             + "\"ddd\" 456 \"eee\" \"ff\"  \"  \"\" aaa \"\"   \"";
  Pattern reLines = Pattern.compile("\\n");          
  Pattern reTsv = Pattern.compile("\\t(?=(?:[^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)");
  String[] lines = reLines.split(sourcestring);
  for(int linesIdx = 0; linesIdx < lines.length; linesIdx++ ) {
    String[] parts = reTsv.split(lines[linesIdx]);
    for(int partsIdx = 0; partsIdx < parts.length; partsIdx++ ) {
        System.out.println( "[" + partsIdx + "] = " + parts[partsIdx]);
      }
    }
  }
}

输出:

[0] = "aaa"
[1] = 123
[2] = "bbb"
[3] = "cc"
[4] = "ddd"
[0] = "aaa"
[1] = 123
[2] = "bbb"
[3] = "cc"
[4] = "  6"
[0] = "ddd"
[1] = 456
[2] = "eee"
[3] = "ff"
[4] = "         ""     "
[0] = "ddd"
[1] = 456
[2] = "eee"
[3] = "ff"
[4] = " "" aaa ""   "