我正在编写程序来尝试清理我拥有的文本文件中的数据。该文件包含我和朋友之间的文本消息,因此它看起来像这样的格式:
06/07/2016, 21:44 - Friend 1: Sure.
So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
06/07/2016, 21:44 - Friend 1: Any further questions?
06/07/2016, 21:45 - Friend 1: Just to clarify, one must apply before, not after, said date.
06/07/2016, 21:42 - Friend 2: Still getting my head around this. Could you explain the deadline thing once more
06/07/2016, 21:46 - Friend 3: All I can say is that I've some fantastic friends that will always endeavour me!
06/07/2016, 21:47 - Friend 3: I truly appreciate this
28/12/2016, 19:04 - Friend 4: Woo party not in mine and eds
28/12/2016, 19:14 - Friend 1: You going?
Steve?
28/12/2016, 19:15 - Friend 5: got ppl renting in house til end of January
所以这些都存储在.txt文件中,我想清理数据并将其转换为.csv文件,该文件基本上包含列Date,Time,Name,Text
我试图遍历文件并拆分该行并将其写入新的CSV文件,例如文件中的这些行:
06/07/2016, 21:44 - Friend 1: Sure.
So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
将组合成一行,如下所示:
06/07/2016, 21:44 - Friend 1: Sure. So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
我知道每条新消息都以dd / mm / yyyy格式的相同日期模式开头。所以我用它来确定何时遇到新消息
现在我不打算将其写入CSV文件,只需将文本重新格式化为正确的格式,然后再对其进行进一步处理。但是对于上面给出的示例输入,它输出:
06/07/2016, 21:44 - Friend 1: Sure. So there's usually a date set by the Commissioners which serves as a deadline. If you haven't applied for tax back before that date, you won't be eligible for a refund.
06/07/2016, 21:44 - Friend 1: Any further questions?
06/07/2016, 21:45 - Friend 1: Just to clarify, one must apply before, not after, said date.
06/07/2016, 21:42 - Friend 2: Still getting my head around this. Could you explain the deadline thing once more
06/07/2016, 21:46 - Friend 3: All I can say is that I've some fantastic friends that will always endeavour me!
06/07/2016, 21:47 - Friend 3: I truly appreciate this
28/12/2016, 19:04 - Friend 4: Woo party not in mine and eds 🥂🎉🎉
28/12/2016, 19:14 - Friend 1: You going?
Steve?
28/12/2016, 19:15 - Friend 5: got ppl renting in house til end of January
所以你可以看到它适用于第一种情况,但不适用于第二种情况,而且我无法找到解决方案来解决这个问题。我的代码如下,有人可以就如何解决这个问题向我提出一些建议吗?
代码
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class App {
private static String line;
private static final String regex = "^\\d{2}\\/\\d{2}\\/\\d{4}";
private static Pattern pattern;
public static void main(String[] args) {
pattern = Pattern.compile(regex);
try {
BufferedReader reader = new BufferedReader(new FileReader("src/main/resources/WhatsAppChat2.txt"));
while ((line = reader.readLine()) != null) {
StringBuilder sb = new StringBuilder();
boolean isNewMessage = identifyNewMessage();
//If message is split over multiple lines, it is combined into one line
if(isNewMessage) {
sb.append(line);
while ((line = reader.readLine()) != null) {
String text = line;
isNewMessage = identifyNewMessage();
if(!isNewMessage) {
sb.append(" " + line);
}
else {
break;
}
}
}
System.out.println(sb.toString());
System.out.println(line);
//formatText(sb.toString());
//formatText(line);
}
reader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
/**
* Checks if file line is a new message or not
* @return - True if it is a message message, False if not
*/
private static boolean identifyNewMessage() {
Matcher m = pattern.matcher(line);
if(m.find()) {
return true;
}
else {
return false;
}
}
}
答案 0 :(得分:1)
使用这种模式:
^(\d{2}\/\d{2}\/\d{4}), (\d{2}:\d{2}) - (.*):(.*)$
你应该可以选择4个捕获组。
1-日期为99/99/9999
2-时间为99:99
3-朋友的名字(跟随空格后的任何事物,以及':'字符。
4-在'之后的任何评论:'直到句末。
通过读取每个捕获组,您可以格式化csv文件的输出。
请记住,该模式假定在示例中编写的白色空格。
答案 1 :(得分:1)
如果内存和速度不是问题(我怀疑它们是否有讨论日志),我会这样做:
Deque<String> mergedLines = new LinkedList<> ();
while ((line = reader.readLine()) != null) {
if (!identifyNewMessage()) {
String currentLine = mergedLines.removeLast();
line = currentLine + " " + line;
}
mergedLines.add(line);
}
现在你可以遍历列表并做任何你需要做的事情。
请注意,如果第一行不是新消息,代码将抛出异常。
答案 2 :(得分:1)
您可以使用
^
(?P<date>\d{2}[^-]+)\s+-\s+
(?P<friend>[^:]+):
(?P<msg>[\s\S]+?(?=^\d{2}|\Z))
<小时/> 细分:
^ # start of the line
(?P<date>\d{2}[^-]+)\s+-\s+ # two digits, followed by anything not a -
(?P<friend>[^:]+): # the friendly neighborhood group
(?P<msg>[\s\S]+?(?=^\d{2}|\Z)) # match anything up to either
# a new date or the very end of the string
请参阅a demo on regex101.com(并注意修改器,另外,需要在Java
中转义反斜杠。)