Question

我将使用Java程序解析一些网页。为此，我编写了一个小代码，用于使用xpath作为选择器来解析页面内容。要解析不同的站点，您需要为每个站点找到合适的xpath。问题是要做到这一点，你需要一个运算符来为你找到写xpath。（例如使用firepath firefox addon）假设您不知道应该解析哪个页面，或者站点数量变得非常大，以便操作员找到正确的xpath。在这种情况下，您需要一种不使用任何选择器来解析页面的方法。（CSS选择器存在相同的场景）或者应该有一种方法可以自动找到xpath！我想知道以这种方式解析网页的方法是什么？以下是我为此目的编写的小代码，请随时向您提供解决方案。

public downloadHTML(String url) throws IOException{
        CleanerProperties props = new CleanerProperties();

        // set some properties to non-default values
        props.setTranslateSpecialEntities(true);
        props.setTransResCharsToNCR(true);
        props.setOmitComments(true);

        // do parsing
        TagNode tagNode = new HtmlCleaner(props).clean(
            new URL(url)
        );

        // serialize to xml file
        new PrettyXmlSerializer(props).writeToFile(
            tagNode, "c:\\TEMP\\clean.xml", "utf-8"
        );
    }


public static void testJavaxXpath(String pattern)
            throws ParserConfigurationException, SAXException, IOException,
            FileNotFoundException, XPathExpressionException {

        DocumentBuilder b = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder();
        org.w3c.dom.Document doc = b.parse(new FileInputStream(
                "c:\\TEMP\\clean.xml"));

        // Evaluate XPath against Document itself
        javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
        NodeList nodes = (NodeList) xPath.evaluate(pattern,
                doc.getDocumentElement(), XPathConstants.NODESET);
        for (int i = 0; i < nodes.getLength(); ++i) {
            Element e = (Element) nodes.item(i);
            System.out.println(e.getFirstChild().getTextContent());
        }
    }

在不使用选择器的情况下解析html页面内容

0 个答案: