Question

我有以下格式的大型XML文件。我可以逐行阅读并进行一些字符串操作，因为我只需要提取几个字段的值。但是，一般来说，我们如何处理以下格式的文件？我找到了Mahout XML解析器，但我认为它不适用于以下格式。

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="1" PostTypeId="1" AcceptedAnswerId="13" CreationDate="2010-09-13T19:16:26.763" Score="155" ViewCount="160162" Body="&lt;p&gt;This is a common question by those who have just rooted their phones.  What apps, ROMs, benefits, etc. do I get from rooting?  What should I be doing now?&lt;/p&gt;&#xA;" OwnerUserId="10" LastEditorUserId="16575" LastEditDate="2013-04-05T15:50:48.133" LastActivityDate="2013-09-03T05:57:21.440" Title="I've rooted my phone.  Now what?  What do I gain from rooting?" Tags="&lt;rooting&gt;&lt;root&gt;" AnswerCount="2" CommentCount="0" FavoriteCount="107" CommunityOwnedDate="2011-01-25T08:44:10.820" />
  <row Id="2" PostTypeId="1" AcceptedAnswerId="4" CreationDate="2010-09-13T19:17:17.917" Score="10" ViewCount="966" Body="&lt;p&gt;I have a Google Nexus One with Android 2.2. I didn't like the default SMS-application so I installed Handcent-SMS. Now when I get an SMS, I get notified twice. How can I fix this?&lt;/p&gt;&#xA;" OwnerUserId="7" LastEditorUserId="981" LastEditDate="2011-11-01T18:30:32.300" LastActivityDate="2011-11-01T18:30:32.300" Title="I installed another SMS application, now I get notified twice" Tags="&lt;2.2-froyo&gt;&lt;sms&gt;&lt;notifications&gt;&lt;handcent-sms&gt;" AnswerCount="3" FavoriteCount="2" />
</posts>

Answer 1

您发布的数据来自SO数据转储（我知道因为我目前正在Hadoop上玩它）。以下是我编写的映射器，用于创建一个制表符分隔文件。

您基本上逐行阅读并使用JAXP api解析并提取所需信息

public class StackoverflowDataWranglerMapper extends Mapper<LongWritable, Text, Text, Text>
{

    private final Text outputKey = new Text();
    private final Text outputValue = new Text();

    private final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    private DocumentBuilder builder;
    private static final Joiner TAG_JOINER = Joiner.on(",").skipNulls();
    // 2008-07-31T21:42:52.667
    private static final DateFormat DATE_PARSER = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
    private static final SimpleDateFormat DATE_BUILDER = new SimpleDateFormat("yyyy-MM-dd");

    @Override
    protected void setup(Context context) throws IOException, InterruptedException
    {
        try
        {
            builder = factory.newDocumentBuilder();
        }
        catch (ParserConfigurationException e)
        {
            new IOException(e);
        }
    }

    @Override
    protected void map(LongWritable inputKey, Text inputValue, Mapper<LongWritable, Text, Text, Text>.Context context)
            throws IOException, InterruptedException
    {
        try
        {
            String entry = inputValue.toString();
            if (entry.contains("<row "))
            {
                Document doc = builder.parse(new InputSource(new StringReader(entry)));
                Element rootElem = doc.getDocumentElement();

                String id = rootElem.getAttribute("Id");
                String postedBy = rootElem.getAttribute("OwnerUserId").trim();
                String viewCount = rootElem.getAttribute("ViewCount");
                String postTypeId = rootElem.getAttribute("PostTypeId");
                String score = rootElem.getAttribute("Score");
                String title = rootElem.getAttribute("Title");
                String tags = rootElem.getAttribute("Tags");
                String answerCount = rootElem.getAttribute("AnswerCount");
                String commentCount = rootElem.getAttribute("CommentCount");
                String favoriteCount = rootElem.getAttribute("FavoriteCount");
                String creationDate = rootElem.getAttribute("CreationDate");

                Date parsedDate = null;
                if (creationDate != null && creationDate.trim().length() > 0)
                {
                    try
                    {
                        parsedDate = DATE_PARSER.parse(creationDate);
                    }
                    catch (ParseException e)
                    {
                        context.getCounter("Bad Record Counters", "Posts missing CreationDate").increment(1);
                    }
                }

                if (postedBy.length() == 0 || postedBy.trim().equals("-1"))
                {
                    context.getCounter("Bad Record Counters", "Posts with either empty UserId or UserId contains '-1'")
                            .increment(1);
                    try
                    {
                        parsedDate = DATE_BUILDER.parse("2100-00-01");
                    }
                    catch (ParseException e)
                    {
                        // ignore
                    }
                }

                tags = tags.trim();
                String tagTokens[] = null;

                if (tags.length() > 1)
                {
                    tagTokens = tags.substring(1, tags.length() - 1).split("><");
                }
                else
                {
                    context.getCounter("Bad Record Counters", "Untagged Posts").increment(1);
                }

                outputKey.clear();
                outputKey.set(id);

                StringBuilder sb = new StringBuilder(postedBy).append("\t").append(parsedDate.getTime()).append("\t")
                        .append(postTypeId).append("\t").append(title).append("\t").append(viewCount).append("\t").append(score)
                        .append("\t");

                if (tagTokens != null)
                {
                    sb.append(TAG_JOINER.join(tagTokens)).append("\t");
                }
                else
                {
                    sb.append("").append("\t");
                }
                sb.append(answerCount).append("\t").append(commentCount).append("\t").append(favoriteCount).toString();

                outputValue.set(sb.toString());

                context.write(outputKey, outputValue);
            }
        }
        catch (SAXException e)
        {
            context.getCounter("Bad Record Counters", "Unparsable records").increment(1);
        }
        finally
        {
            builder.reset();
        }
    }
}

如何在Hadoop Mapper中处理XML文件

1 个答案: