文本解析引号MapReduce Java

时间:2015-04-22 06:01:39

标签: java xml parsing mapreduce

我有一个文档,其中包含我需要解析的众多字符串。 每一行的形式如下:

PageID  Title   Date    <XML Article>

字符串的格式如下:

12146635    Pardulus of Laon    2009-11-01 10:51:01 <?xml version="1.0" encoding="UTF-8"?>\n<articles loadtime="0 sec" rendertime="0.002 sec" totaltime="0.002 sec"><article><paragraph><sentence id="12146635/0"><bold><link synthetic="true"><target>Pardulus of Laon</target></link></bold><extension extension_name="ref">Pardoul, Pardule de Laon, Pardulus Laudunensis.</extension><space/>was<space/><link><target>bishop of Laon</target></link>, France, from 847 to 857.</sentence> <sentence id="12146635/1">He is known for his participation in theological controversy.</sentence> <sentence id="12146635/2">A letter of his to<space/><link><target>Hincmar of Reims</target></link><space/>is known<extension extension_name="ref"><link type="external" href="http://www.forumromanum.org/literature/pardulus_laudunensis/hincmar.html"/>, page in French, online text in Latin.</extension>.</sentence></paragraph><heading level="2">Notes</heading><paragraph><extension extension_name="references"/></paragraph><paragraph><sentence id="12146635/3"><link><target>Category:9th-century bishops</target></link><link><target>Category:Bishops of Laon</target></link></sentence></paragraph><paragraph><sentence id="12146635/4"><link><target>nl:Pardulus van Laon</target></link></sentence></paragraph></article></articles>   Pardulus of Laon was bishop of Laon, France, from 847 to 857. He is known for his participation in theological controversy. A letter of his to Hincmar of Reims is known.\n\n

我正在编写一个MapReduce程序来从第四列中删除所有XML标记并获得以下输出模式:

Title, Text

其中Text是没有XML标签的<XML Article>,还有一些案例传感。对于上述情况,输出如下:

pardulus of laon, pardule de laon pardulus laudunensis bishop of laon france from to he is known for his participation in theological controversy a letter of his to hincmar of reims is known page in french online text in latin notes category th century bishops category bishops of laon nl pardulus van laon pardulus of laon was bishop of laon france from to he is known for his participation in theological controversy a letter of his to hincmar of reims is known

现在,我遇到的问题是文章标签本身中的双引号" "。到目前为止,我已设法编写以下代码:

public static class Map extends Mapper<LongWritable, Text, Text, Text> {
    public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
        String[] line = value.toString().split("\t"); 
        String title = line[1].replaceAll("[^a-zA-Z]", " ").trim()
                .replaceAll(" +", " ").toLowerCase();
        String text = value.toString().replaceAll("[^a-zA-Z]", " ").trim()
                .replaceAll(" +", " ").toLowerCase();
        context.write(new Text(title), new Text(text));
        }
    }

我不知道手中的字符串,所以我不能使用replaceAll(""", "\""),这是双引号的转义。

我该如何解决这个问题?

0 个答案:

没有答案