从文件中读取多行并根据开始和结束模式将它们组合成一行?

时间:2018-03-07 18:38:08

标签: java regex algorithm

我正在编写程序来尝试清理我拥有的文本文件中的数据。该文件包含我和朋友之间的文本消息,因此它看起来像这样的格式:

06/07/2016, 21:44 - Friend 1: Sure. 

So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
06/07/2016, 21:44 - Friend 1: Any further questions?
06/07/2016, 21:45 - Friend 1: Just to clarify, one must apply before, not after, said date.
06/07/2016, 21:42 - Friend 2: Still getting my head around this. Could you explain the deadline thing once more
06/07/2016, 21:46 - Friend 3: All I can say is that I've some fantastic friends that will always endeavour me!
06/07/2016, 21:47 - Friend 3: I truly appreciate this
28/12/2016, 19:04 - Friend 4: Woo party not in mine and eds 
28/12/2016, 19:14 - Friend 1: You going?
Steve?
28/12/2016, 19:15 - Friend 5: got ppl renting in house til end of January

所以这些都存储在.txt文件中,我想清理数据并将其转换为.csv文件,该文件基本上包含列Date,Time,Name,Text

我试图遍历文件并拆分该行并将其写入新的CSV文件,例如文件中的这些行:

06/07/2016, 21:44 - Friend 1: Sure. 

So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.

将组合成一行,如下所示:

06/07/2016, 21:44 - Friend 1: Sure. So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.

我知道每条新消息都以dd / mm / yyyy格式的相同日期模式开头。所以我用它来确定何时遇到新消息

现在我不打算将其写入CSV文件,只需将文本重新格式化为正确的格式,然后再对其进行进一步处理。但是对于上面给出的示例输入,它输出:

06/07/2016, 21:44 - Friend 1: Sure.   So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
06/07/2016, 21:44 - Friend 1: Any further questions?
06/07/2016, 21:45 - Friend 1: Just to clarify, one must apply before, not after, said date.
06/07/2016, 21:42 - Friend 2: Still getting my head around this. Could you explain the deadline thing once more
06/07/2016, 21:46 - Friend 3: All I can say is that I've some fantastic friends that will always endeavour me!
06/07/2016, 21:47 - Friend 3: I truly appreciate this
28/12/2016, 19:04 - Friend 4: Woo party not in mine and eds 🥂🎉🎉
28/12/2016, 19:14 - Friend 1: You going?

Steve?
28/12/2016, 19:15 - Friend 5: got ppl renting in house til end of January

所以你可以看到它适用于第一种情况,但不适用于第二种情况,而且我无法找到解决方案来解决这个问题。我的代码如下,有人可以就如何解决这个问题向我提出一些建议吗?

代码

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class App {

    private static String line;
    private static final String regex = "^\\d{2}\\/\\d{2}\\/\\d{4}";
    private static Pattern pattern;

    public static void main(String[] args) {

        pattern = Pattern.compile(regex);

        try {
            BufferedReader reader = new BufferedReader(new FileReader("src/main/resources/WhatsAppChat2.txt"));
            while ((line = reader.readLine()) != null) {
                StringBuilder sb = new StringBuilder();
                boolean isNewMessage = identifyNewMessage();

                //If message is split over multiple lines, it is combined into one line
                if(isNewMessage) {
                    sb.append(line);    
                    while ((line = reader.readLine()) != null) {
                        String text = line;
                        isNewMessage = identifyNewMessage();
                        if(!isNewMessage) {
                            sb.append(" " + line);
                        }
                        else {
                            break;
                        }
                    }
                }

                System.out.println(sb.toString());
                System.out.println(line);
                //formatText(sb.toString());
                //formatText(line);
            }
            reader.close();
        } 
        catch (IOException e) {
            e.printStackTrace();
        }
    }

    /**
     * Checks if file line is a new message or not
     * @return      - True if it is a message message, False if not
     */
    private static boolean identifyNewMessage() {

        Matcher m = pattern.matcher(line);
        if(m.find()) {
            return true;
        }
        else {
            return false;
        }
    }
}

3 个答案:

答案 0 :(得分:1)

使用这种模式:

^(\d{2}\/\d{2}\/\d{4}), (\d{2}:\d{2}) - (.*):(.*)$

你应该可以选择4个捕获组。

1-日期为99/99/9999
2-时间为99:99
3-朋友的名字(跟随空格后的任何事物,以及':'字符。
4-在'之后的任何评论:'直到句末。

通过读取每个捕获组,您可以格式化csv文件的输出。

请记住,该模式假定在示例中编写的白色空格。

答案 1 :(得分:1)

如果内存和速度不是问题(我怀疑它们是否有讨论日志),我会这样做:

Deque<String> mergedLines = new LinkedList<> ();

while ((line = reader.readLine()) != null) {
  if (!identifyNewMessage()) {
    String currentLine = mergedLines.removeLast();
    line = currentLine + " " + line;
  }
  mergedLines.add(line);
}

现在你可以遍历列表并做任何你需要做的事情。

请注意,如果第一行不是新消息,代码将抛出异常。

答案 2 :(得分:1)

您可以使用

^
(?P<date>\d{2}[^-]+)\s+-\s+
(?P<friend>[^:]+):
(?P<msg>[\s\S]+?(?=^\d{2}|\Z))

<小时/> 细分:

^                              # start of the line
(?P<date>\d{2}[^-]+)\s+-\s+    # two digits, followed by anything not a -
(?P<friend>[^:]+):             # the friendly neighborhood group
(?P<msg>[\s\S]+?(?=^\d{2}|\Z)) # match anything up to either 
                               # a new date or the very end of the string

请参阅a demo on regex101.com并注意修改器,另外,需要在Java 中转义反斜杠。)

<小时/> 正如@assylias指出的那样,人们需要在之前将整个文件读作字符串。