我正在尝试解析多个文件并将它们拆分为HashMap中的一组字段。这是一个标本文件。
COCONUT OIL CONTRACT TO CHANGE - DUTCH TRADERS
ROTTERDAM, March 18 - Contract terms for trade in coconut
oil are to be changed from long tons to tonnes with effect from
the Aug/Sep contract onwards, Dutch vegetable oil traders said.
Operators have already started to take account of the
expected change and reported at least one trade in tonnes for
Aug/Sept shipment yesterday.
我需要程序将此文档解析为自定义文档类中的字段,该文档类包含键,文件名,文件标题,位置,日期,作者,内容,类别。
这是我尝试过的。
public static Document parse(String filename) {
File f = new File(filename);
if (f.isFile()){
String fileId;
if (filename.indexOf(".") > 0) {
fileId = filename.substring(0, filename.lastIndexOf("."));
}
String category = f.getParent();
InputStream in = new FileInputStream(f);
byte buf[] = new byte[1024];
int len = in.read(buf);
while(len > 0){
..........
}
in.close();
}
return null;
}
答案 0 :(得分:0)
以下代码可以为您提供帮助:
try {
FileInputStream fstream = new FileInputStream("myFile.txt");
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
StringBuffer contentBuffer = new StringBuffer();
String line = null;
boolean foundTitle = false;
boolean foundPlaceAndDate = false;
String date = "";
while ((line = br.readLine()) != null) {
if (line.matches("^[a-z-A-Z0-9].*") && !foundTitle) {
// If line starts with a letter or number and has no title yet, that's the title
System.out.println("Title: " + line);
foundTitle = true;
} else if (line.matches("^[\\ \t].*") && !foundPlaceAndDate) {
// If line starts with a space or tab and it's out first paragraph, then this paragraph has place and date
System.out.println("Place: " + line.trim().substring(0, line.trim().indexOf(",")));
date = line.trim().substring(line.trim().indexOf(",") + 1, line.trim().indexOf("-")).trim();
System.out.println("Date: " + date);
foundPlaceAndDate = true;
}
contentBuffer.append(line);
}
String content = contentBuffer.toString().substring(contentBuffer.toString().indexOf(date) + date.length() + 2).trim();
System.out.println("Content: " + content);
br.close();
fstream.close();
} catch (Exception e) {
System.err.println("Oh no! I got the following error: " + e.getMessage());
}
输出将是:
标题:COCONUT OIL合同改变 - DUTCH TRADERS
放置:ROTTERDAM
日期:3月18日
内容:荷兰植物油交易商表示,从8月/ 9月合约开始,椰子油交易的合约条款将从长吨变为吨。运营商已经开始考虑到预期的变化,并且昨天发布了至少一笔以吨为单位的交易。