使用Java解析文本文件以获取字段的HashMap

时间:2014-09-19 19:16:08

标签: java

我正在尝试解析多个文件并将它们拆分为HashMap中的一组字段。这是一个标本文件。

COCONUT OIL CONTRACT TO CHANGE - DUTCH TRADERS

    ROTTERDAM, March 18 - Contract terms for trade in coconut
oil are to be changed from long tons to tonnes with effect from
the Aug/Sep contract onwards, Dutch vegetable oil traders said.
    Operators have already started to take account of the
expected change and reported at least one trade in tonnes for
Aug/Sept shipment yesterday.

我需要程序将此文档解析为自定义文档类中的字段,该文档类包含键,文件名,文件标题,位置,日期,作者,内容,类别。

这是我尝试过的。

public static Document parse(String filename) {

        File f = new File(filename);

        if (f.isFile()){



            String fileId;
            if (filename.indexOf(".") > 0) {
                fileId = filename.substring(0, filename.lastIndexOf("."));
            }
            String category = f.getParent();

            InputStream in = new FileInputStream(f);

            byte buf[] = new byte[1024];
            int len = in.read(buf);
            while(len > 0){
               ..........
            }
            in.close();
        }


        return null;
    }

1 个答案:

答案 0 :(得分:0)

以下代码可以为您提供帮助:

try {
        FileInputStream fstream = new FileInputStream("myFile.txt");
        DataInputStream in = new DataInputStream(fstream);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
        StringBuffer contentBuffer = new StringBuffer();
        String line = null;
        boolean foundTitle = false;
        boolean foundPlaceAndDate = false;
        String date = "";
        while ((line = br.readLine()) != null) {
            if (line.matches("^[a-z-A-Z0-9].*") && !foundTitle) {
                // If line starts with a letter or number and has no title yet, that's the title
                System.out.println("Title: " + line);
                foundTitle = true;
            } else if (line.matches("^[\\ \t].*") && !foundPlaceAndDate) {
                // If line starts with a space or tab and it's out first paragraph, then this paragraph has place and date
                System.out.println("Place: " + line.trim().substring(0, line.trim().indexOf(",")));
                date = line.trim().substring(line.trim().indexOf(",") + 1, line.trim().indexOf("-")).trim();
                System.out.println("Date: " + date);
                foundPlaceAndDate = true;
            }
            contentBuffer.append(line);
        }

        String content = contentBuffer.toString().substring(contentBuffer.toString().indexOf(date) + date.length() + 2).trim();
        System.out.println("Content: " + content);

        br.close();
        fstream.close();
    } catch (Exception e) {
        System.err.println("Oh no! I got the following error: " + e.getMessage());
    }

输出将是:

标题:COCONUT OIL合同改变 - DUTCH TRADERS

放置:ROTTERDAM

日期:3月18日

内容:荷兰植物油交易商表示,从8月/ 9月合约开始,椰子油交易的合约条款将从长吨变为吨。运营商已经开始考虑到预期的变化,并且昨天发布了至少一笔以吨为单位的交易。