Question

我正在玩nutch。我正在尝试编写一些内容，其中还包括检测DOM结构中的特定节点以及从节点周围提取文本数据。例如来自父节点，兄弟节点等的文本。我研究并阅读了一些示例，然后尝试编写一个插件，为图像节点执行此操作。一些代码，

    if("img".equalsIgnoreCase(nodeName) && nodeType == Node.ELEMENT_NODE){
            String imageUrl = "No Url"; 
            String altText = "No Text";
            String imageName = "No Image Name"; //For the sake of simpler code, default values set to
                                                //avoid nullpointerException in findMatches method

            NamedNodeMap attributes = currentNode.getAttributes();
            List<String>ParentNodesText = new ArrayList<String>();
            ParentNodesText = getSurroundingText(currentNode);

            //Analyze the attributes values inside the img node. <img src="xxx" alt="myPic"> 
            for(int i = 0; i < attributes.getLength(); i++){
                Attr attr = (Attr)attributes.item(i);   
                if("src".equalsIgnoreCase(attr.getName())){
                    imageUrl = getImageUrl(base, attr);
                    imageName = getImageName(imageUrl);
                }
                else if("alt".equalsIgnoreCase(attr.getName())){
                    altText = attr.getValue().toLowerCase();
                }
            }

  private List<String> getSurroundingText(Node currentNode){

    List<String> SurroundingText = new ArrayList<String>();
    while(currentNode  != null){
        if(currentNode.getNodeType() == Node.TEXT_NODE){
            String text = currentNode.getNodeValue().trim();
            SurroundingText.add(text.toLowerCase());
        }

        if(currentNode.getPreviousSibling() != null && currentNode.getPreviousSibling().getNodeType() == Node.TEXT_NODE){
            String text = currentNode.getPreviousSibling().getNodeValue().trim();
            SurroundingText.add(text.toLowerCase());
        }
        currentNode = currentNode.getParentNode();
    }   
    return SurroundingText;
}

这似乎无法正常工作。检测到img标记，检索图像名称和URL但没有更多帮助。 getSurroundingText模块看起来太丑了，我试过但无法改进它。我不清楚从何处以及如何提取可能与图像相关的文本。有什么帮助吗？

Answer 1

你是在正确的轨道上，另一方面，看看这个示例代码的HTML：

<div>
   <span>test1</span>
   <img src="http://example.com" alt="test image" title="awesome title">
   <span>test2</span>
</div>

在你的情况下，我认为问题在于img节点的兄弟节点，例如你正在寻找直接的兄弟节点，你可能会认为在前面的例子中这些将是span节点，但在这种情况下是一些虚拟文本节点，因此当您要求img的兄弟节点时，您将获得没有实际文本的空节点。

如果我们将以前的HTML重写为：<div><span>test1</span><img src="http://example.com" alt="test image" title="awesome title"><span>test2</span></div>，那么img的兄弟节点将是您想要的span个节点。

我假设你在上一个例子中想要同时获得“text1”和“text2”，在这种情况下你需要实际继续移动，直到找到一些Node.ELEMENT_NODE，然后获取里面的文本节点。一个好的做法是不抓取您找到的任何内容，但将范围限制为p，span，div以提高准确性。

如何获取节点的周围文本？

1 个答案: