解析制表符分隔文件

时间:2016-07-20 01:21:39

标签: java regex

我试图从IMDB TSV:

$hutter             Battle of the Sexes (2017)  (as $hutter Boy)  [Bobby Riggs Fan]  <10>
                    NVTION: The Star Nation Rapumentary (2016)  (as $hutter Boy)  [Himself]  <1>
                    Secret in Their Eyes (2015)  (uncredited)  [2002 Dodger Fan]
                    Steve Jobs (2015)  (uncredited)  [1988 Opera House Patron]
                    Straight Outta Compton (2015)  (uncredited)  [Club Patron/Dopeman]



$lim, Bee Moe       Fatherhood 101 (2013)  (as Brandon Moore)  [Himself - President, Passages]
                    For Thy Love 2 (2009)  [Thug 1]
                    Night of the Jackals (2009) (V)  [Trooth]
                    "Idle Talk" (2013)  (as Brandon Moore)  [Himself]
                    "Idle Times" (2012) {(#1.1)}  (as Brandon Moore)  [Detective Ryan Turner]

enter image description here

正如你可以看到一些行以制表符开头而有些行则没有。我想要一张地图,其中以演员的名字为键,电影列表为值。在演员的名字之间是一个或多个标签,直到电影列出。

我的代码:

        while ((line = reader.readLine()) != null) {

            Matcher matcher = headerPattern.matcher(line);
            boolean headerMatchFound = matcher.matches();

            if (headerMatchFound) {
                Logger.getLogger(ActorListParser.class.getName()).log(Level.INFO, "Header for actor list found");

                String newline;

                reader.readLine();

                while ((newline = reader.readLine()) != null) {
                    String[] fullLine = null;

                    String actor;
                    String title;

                    Pattern startsWithTab = Pattern.compile("^\t.*");
                    Matcher tab = startsWithTab.matcher(newline);
                    boolean tabStartMatcher = tab.matches();

                    if (!tabStartMatcher) {

                        fullLine = newline.split("\t.*");

                   System.out.println("Actor: " + fullLine[0] +
                          "Movie: " + fullLine[1]);

                   }//this line will have code to match lines that start with tabs.
                }
          } 

        }

我做到这一点的方式只能在我得到arrayoutofbounds异常之前的几行。如果它们有一个或多个标签,我如何解析这些行并将它们分成最多2个字符串?

1 个答案:

答案 0 :(得分:1)

解析与引用和转义有关的制表符/逗号分隔数据文件有一些细微之处。

为了节省大量工作,挫折和头痛,您应该考虑使用现有的CSV解析库之一,例如OpenCSV或Apache Commons CSV。

作为答案而不是评论发布,因为OP没有说明重新发明轮子的理由,并且有些任务确实已经解决了#34;一劳永逸。