Question

我正在尝试从RSS源中提取一些XHTML，以便将其放在WebView中。有问题的RSS源有一个名为<content>的标签，内容中的字符是XHTML。（我正在削减的网站是一个博主提要）尝试提取此内容的最佳方法是什么？ <个字符使我的解析器混乱。我已尝试过DOM和SAX，但两者都无法很好地处理。

Here is a sample of the XML as requested.在这种情况下，我希望内容标记中的XHTML基本上是一个字符串。 <content> XHTML </content>

编辑：根据ignyhere的建议，我尝试过XPath，但我仍然遇到同样的问题。 Here is a pastebin sample of my tests.

Answer 1

我会尝试使用XPath攻击它。会这样的吗？

public static String parseAtom (InputStream atomIS) 
   throws Exception { 

   // Below should yield the second content block
   String xpathString = "(//*[starts-with(name(),"content")])[2]";
   // or, String xpathString = "//*[name() = 'content'][2]";
   // remove the '[2]' to get all content tags or get the count, 
   // if needed, and then target specific blocks 
   //String xpathString = "count(//*[starts-with(name(),"content")])"; 
   // note the evaluate expression below returns a glob and not a node set

   XPathFactory xpf              = XPathFactory.newInstance (); 
   XPath xpath                   = xpf.newXPath (); 
   XPathExpression xpathCompiled = xpath.compile (xpathString); 

   // use the first to recast and evaluate as NodeList 
   //Object atomOut = xpathCompiled.evaluate ( 
   //   new InputSource (atomIS), XPathConstants.NODESET); 
   String atomOut = xpathCompiled.evaluate ( 
      new InputSource (atomIS), XPathConstants.STRING); 

   System.out.println (atomOut); 

   return atomOut; 

}

Answer 2

它并不漂亮，但这是我使用XmlPullParser从Blogger解析ATOM提要的（实质）。代码非常icky，但它来自一个真正的应用程序。无论如何，你可能会得到它的一般风味。

    final String TAG_FEED = "feed";

public int parseXml(Reader reader) {
    XmlPullParserFactory factory = null;
    StringBuilder out = new StringBuilder();
    int entries = 0;

    try {
        factory = XmlPullParserFactory.newInstance();
        factory.setNamespaceAware(true);
        XmlPullParser xpp = factory.newPullParser();
        xpp.setInput(reader);

        while (true) {
            int eventType = xpp.next();
            if (eventType == XmlPullParser.END_DOCUMENT) {
                break;
            } else if (eventType == XmlPullParser.START_DOCUMENT) {
                out.append("Start document\n");
            } else if (eventType == XmlPullParser.START_TAG) {
                String tag = xpp.getName();
                // out.append("Start tag " + tag + "\n");
                if (TAG_FEED.equalsIgnoreCase(tag)) {
                    entries = parseFeed(xpp);
                }
            } else if (eventType == XmlPullParser.END_TAG) {
                // out.append("End tag " + xpp.getName() + "\n");
            } else if (eventType == XmlPullParser.TEXT) {
                // out.append("Text " + xpp.getText() + "\n");
            }
        }
        out.append("End document\n");

    } catch (XmlPullParserException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    //        return out.toString();
    return entries;

}

private int parseFeed(XmlPullParser xpp) throws XmlPullParserException, IOException {
    int depth = xpp.getDepth();
    assert (depth == 1);
    int eventType;
    int entries = 0;
    xpp.require(XmlPullParser.START_TAG, null, TAG_FEED);
    while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT) && (xpp.getDepth() > depth)) {
        // loop invariant: At this point, the parser is not sitting on
        // end-of-document, and is at a level deeper than where it started.

        if (eventType == XmlPullParser.START_TAG) {
            String tag = xpp.getName();
            // Log.d("parseFeed", "Start tag: " + tag);    // Uncomment to debug
            if (FeedEntry.TAG_ENTRY.equalsIgnoreCase(tag)) {
                FeedEntry feedEntry = new FeedEntry(xpp);
                feedEntry.persist(this);
                entries++;
                // Log.d("FeedEntry", feedEntry.title);    // Uncomment to debug
                // xpp.require(XmlPullParser.END_TAG, null, tag);
            }
        }
    }
    assert (depth == 1);
    return entries;
}

class FeedEntry {
    String id;
    String published;
    String updated;
    // Timestamp lastRead;
    String title;
    String subtitle;
    String authorName;
    int contentType;
    String content;
    String preview;
    String origLink;
    String thumbnailUri;
    // Media media;

    static final String TAG_ENTRY = "entry";
    static final String TAG_ENTRY_ID = "id";
    static final String TAG_TITLE = "title";
    static final String TAG_SUBTITLE = "subtitle";
    static final String TAG_UPDATED = "updated";
    static final String TAG_PUBLISHED = "published";
    static final String TAG_AUTHOR = "author";
    static final String TAG_CONTENT = "content";
    static final String TAG_TYPE = "type";
    static final String TAG_ORIG_LINK = "origLink";
    static final String TAG_THUMBNAIL = "thumbnail";
    static final String ATTRIBUTE_URL = "url";

    /**
    * Create a FeedEntry by pulling its bits out of an XML Pull Parser. Side effect: Advances
    * XmlPullParser.
    * 
    * @param xpp
    */
public FeedEntry(XmlPullParser xpp) {
    int eventType;
    int depth = xpp.getDepth();
    assert (depth == 2);
    try {
        xpp.require(XmlPullParser.START_TAG, null, TAG_ENTRY);
        while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT)
        && (xpp.getDepth() > depth)) {

            if (eventType == XmlPullParser.START_TAG) {
                String tag = xpp.getName();
                if (TAG_ENTRY_ID.equalsIgnoreCase(tag)) {
                    id = Util.XmlPullTag(xpp, TAG_ENTRY_ID);
                } else if (TAG_TITLE.equalsIgnoreCase(tag)) {
                    title = Util.XmlPullTag(xpp, TAG_TITLE);
                } else if (TAG_SUBTITLE.equalsIgnoreCase(tag)) {
                    subtitle = Util.XmlPullTag(xpp, TAG_SUBTITLE);
                } else if (TAG_UPDATED.equalsIgnoreCase(tag)) {
                    updated = Util.XmlPullTag(xpp, TAG_UPDATED);
                } else if (TAG_PUBLISHED.equalsIgnoreCase(tag)) {
                    published = Util.XmlPullTag(xpp, TAG_PUBLISHED);
                } else if (TAG_CONTENT.equalsIgnoreCase(tag)) {
                    int attributeCount = xpp.getAttributeCount();
                    for (int i = 0; i < attributeCount; i++) {
                        String attributeName = xpp.getAttributeName(i);
                        if (attributeName.equalsIgnoreCase(TAG_TYPE)) {
                            String attributeValue = xpp.getAttributeValue(i);
                            if (attributeValue
                            .equalsIgnoreCase(FeedReaderContract.FeedEntry.ATTRIBUTE_NAME_HTML)) {
                                contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_HTML;
                                } else if (attributeValue
                                .equalsIgnoreCase(FeedReaderContract.FeedEntry.ATTRIBUTE_NAME_XHTML)) {
                                    contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_XHTML;
                                } else {
                                    contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_TEXT;
                                }
                                break;
                            }
                        }
                        content = Util.XmlPullTag(xpp, TAG_CONTENT);
                        extractPreview();
                    } else if (TAG_AUTHOR.equalsIgnoreCase(tag)) {
                        // Skip author for now -- it is complicated
                        int authorDepth = xpp.getDepth();
                        assert (authorDepth == 3);
                        xpp.require(XmlPullParser.START_TAG, null, TAG_AUTHOR);
                        while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT)
                        && (xpp.getDepth() > authorDepth)) {
                        }
                        assert (xpp.getDepth() == 3);
                        xpp.require(XmlPullParser.END_TAG, null, TAG_AUTHOR);

                    } else if (TAG_ORIG_LINK.equalsIgnoreCase(tag)) {
                        origLink = Util.XmlPullTag(xpp, TAG_ORIG_LINK);
                    } else if (TAG_THUMBNAIL.equalsIgnoreCase(tag)) {
                        thumbnailUri = Util.XmlPullAttribute(xpp, tag, null, ATTRIBUTE_URL);
                    } else {
                        @SuppressWarnings("unused")
                            String throwAway = Util.XmlPullTag(xpp, tag);
                    }
                }
            } // while
        } catch (XmlPullParserException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        assert (xpp.getDepth() == 2);
    }
}

public static String XmlPullTag(XmlPullParser xpp, String tag) 
    throws XmlPullParserException, IOException {
    xpp.require(XmlPullParser.START_TAG, null, tag);
    String itemText = xpp.nextText();
    if (xpp.getEventType() != XmlPullParser.END_TAG) {
        xpp.nextTag();
    }
    xpp.require(XmlPullParser.END_TAG, null, tag);
    return itemText;
}

public static String XmlPullAttribute(XmlPullParser xpp, 
    String tag, String namespace, String name)
throws XmlPullParserException, IOException {
    assert (!TextUtils.isEmpty(tag));
    assert (!TextUtils.isEmpty(name));
    xpp.require(XmlPullParser.START_TAG, null, tag);
    String itemText = xpp.getAttributeValue(namespace, name);
    if (xpp.getEventType() != XmlPullParser.END_TAG) {
        xpp.nextTag();
    }
    xpp.require(XmlPullParser.END_TAG, null, tag);
    return itemText;
}

我会给你一个提示：没有一个返回值很重要。数据通过此行调用的方法（未显示）保存到数据库中：

feedEntry.persist(this);

Answer 3

我可以在这里看到您的问题，这些解析器没有产生正确结果的原因是因为<content>标记的内容未包含在<![CDATA[ ]]>中，我会做什么直到找到更多适当的解决方案我会使用快速而肮脏的技巧：

private void parseFile(String fileName) throws IOException {
        String line;
        BufferedReader br = new BufferedReader(new FileReader(new File(fileName)));
        StringBuilder sb = new StringBuilder();
        boolean match = false;

        while ((line = br.readLine()) != null) {
            if(line.contains("<content")){
                sb.append(line);
                sb.append("\n");
                match = true;
                continue;
            }

            if(match){
                sb.append(line);
                sb.append("\n");
                match = false;
            }

            if(line.contains("</content")){
                sb.append(line);
                sb.append("\n");
            }
        }

        System.out.println(sb.toString());
    }

这将为您提供String中的所有内容。您可以通过略微修改此方法来选择性地分隔它们，或者如果您不需要实际的<content>，也可以将其过滤掉。

如何使用Java将XHTML从ATOM提取中提取出来？

3 个答案: