Question

我的第一次发帖！

我遇到的问题是我正在使用XPath和Tag-Soup来解析网页并读入数据。由于这些是新闻文章，有时他们会在内容中嵌入链接，这些都是我的程序搞乱的。

我使用的XPath是storyPath = "//html:article//html:p//text()";，其中页面的结构为：

<article ...>
   <p>Some text from the story.</p>
   <p>More of the story, which proves <a href="">what a great story this is</a>!</p>
   <p>More of the story without links!</p>
</article>

我与xpath评估有关的代码是这样的：

NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
    for (int i=0; i<nL.getLength(); i++) {
        Node n = nL.item(i);

        String tmp = n.toString();
        tmp = tmp.replace("[#text:", "");
        tmp = tmp.replace("]", "");
        tmp = tmp.replaceAll("‚Äô", "'");
        tmp = tmp.replaceAll("‚Äò", "'");
        tmp = tmp.replaceAll("‚Äì", "-");
        tmp = tmp.replaceAll("¬", "");
        tmp = tmp.trim();

        story.add(tmp);
    }

this.setStory(story);
...

private void setStory(LinkedList<String> story) {
    String tmp = "";
    for (String p : story) {
        tmp = tmp + p + "\n\n";
    }

    this.story = tmp.trim();
}

这给我的输出是

Some text from the story.

More of the story, which proves 

what a great story this is

!

More of the story without links!

有没有人有办法消除这个错误？我在某处采取了错误的做法吗？（我知道我很可能使用setStory代码，但是看不到另一种方式。

没有tmp.replace（）代码，所有结果都显示为[#text：这是一个多么好的故事]等等

编辑：

我仍然有麻烦，虽然可能是另一种类型......在这里杀死我的是一个链接，但是BBC拥有他们网站的方式，链接是在一个单独的行，因此它仍然在读取与前面描述的问题相同（注意问题是通过给出的示例修复的）。 BBC页面上的代码部分是：

    <p>    Former Queens Park Rangers trainee Sterling, who 

    <a  href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a> 

    had not started a senior match for the Reds before this season.
    </p>

在我的输出中显示为：

    Former Queens Park Rangers trainee Sterling, who 

    moved to the Merseyside club in February 2010 aged 15, 

         had not started a senior match for the Reds before this season.

Answer 1

首先找到段落，storyPath = "//html:article//html:p，然后对于每个段落，使用另一个xpath查询获取所有文本，并在没有新行的情况下连接它们，并在段落的末尾添加两个新行。

另一方面，你不应该replaceAll("‚Äô", "'")。这是您正确打开文件的确定信号。当您打开文件时，您需要将Reader传递给标签汤。您应该像这样初始化Reader：Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252"));您为文件指定正确的字符集。字符集列表如下：http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html我的猜测是它是Windows拉丁语1。

Answer 2

[#text:只是DOM Text节点的toString()表示。当您需要节点的字符串表示以进行调试时，可以使用toString()方法。而不是toString()使用getTextContent()返回实际文本。

如果您不希望链接内容显示在单独的行上，那么您可以从XPath中删除//text()并直接获取元素节点的textContent（getTextContent()以返回元素返回所有后代文本节点的串联）

String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
    Node n = nL.item(i);
    story.add(n.getTextContent().trim());
}

您必须手动修复"‚Äô"之类的事实，这表明您的HTML实际上是以UTF-8编码的，但您使用的是单字节字符集（如Windows1252）。而不是在事后尝试修复它，而应该首先找出如何以正确的编码读取数据。

Answer 3

对于编辑问题，html源代码中的新行出现在文本文档中，您需要在打印它们之前删除它们。而不是System.out.print(text.trim());做System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));

XPath和链接的麻烦

3 个答案: