XPath使用libxml2在iOS中提取带有多个标签的文本

时间:2013-04-06 12:44:01

标签: ios xpath libxml2

在iOS应用程序中使用libxml2,在解析此HTML文件时(它是大页面的一部分) -

...
<span class="ingredient">
    <span class="amount">
        <span class="value">500 </span> 
        <span class="type">g</span>
    </span>    
    <a href="...">bread flour</a> 
    or 
    <span class="ingredient">
        <span class="amount">
            <span class="value">500 </span> 
            <span class="type">g</span>
        </span>  
        <span class="name">
            <a href="...">all-purpose flour</a>
        </span>
    </span>
</span>
...

我只需提取文字:“500克面包粉或500克通用面粉”。

返回//span[@class="ingredient"] XPath查询的解析后的NSDictionary结果 -

{
    nodeAttributeArray =     (
                {
            attributeName = class;
            nodeContent = ingredient;
        }
    );
    nodeChildArray =     (
                {
            nodeAttributeArray =             (
                                {
                    attributeName = class;
                    nodeContent = amount;
                }
            );
            nodeChildArray =             (
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = value;
                        }
                    );
                    nodeContent = 500;
                    nodeName = span;
                },
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = type;
                        }
                    );
                    nodeContent = g;
                    nodeName = span;
                }
            );
            nodeContent = "";
            nodeName = span;
        },
                {
            nodeAttributeArray =             (
                                {
                    attributeName = href;
                    nodeContent = "http://www.food.com/library/flour-64";
                }
            );
            nodeContent = "bread flour";
            nodeName = a;
        },
                {
            nodeAttributeArray =             (
                                {
                    attributeName = class;
                    nodeContent = ingredient;
                }
            );
            nodeChildArray =             (
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = amount;
                        }
                    );
                    nodeChildArray =                     (
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = class;
                                    nodeContent = value;
                                }
                            );
                            nodeContent = 500;
                            nodeName = span;
                        },
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = class;
                                    nodeContent = type;
                                }
                            );
                            nodeContent = g;
                            nodeName = span;
                        }
                    );
                    nodeContent = "";
                    nodeName = span;
                },
                                {
                    nodeAttributeArray =                     (
                                                {
                            attributeName = class;
                            nodeContent = name;
                        }
                    );
                    nodeChildArray =                     (
                                                {
                            nodeAttributeArray =                             (
                                                                {
                                    attributeName = href;
                                    nodeContent = "http://www.food.com/library/flour-64";
                                }
                            );
                            nodeContent = "all-purpose flour";
                            nodeName = a;
                        }
                    );
                    nodeContent = "";
                    nodeName = span;
                }
            );
            nodeContent = "";
            nodeName = span;
        }
    );
    nodeContent = or;
    nodeName = span;
}

问题是字典根的“nodeContent”是文本“or”,并且所有标记都作为根节点的子节点而存在,因此片段的顺序丢失了 - 我无法分辨或实际上是在所有文字的中间和连续,我得到以下字符串:“或500克面包粉500克通用面粉。”

任何人都可以找到在1个XPath查询中提取纯文本的方法,或者使用XPath引擎来读取有序的元素列表吗?

1 个答案:

答案 0 :(得分:0)

当您需要所有文本节点时,可以使用

轻松完成
//text()

将返回所有节点。您的内容中存在空白空间问题,您可以使用

省略所有仅空白节点
//text()[not(matches(., '$[\s]+$', 'm'))]

之后,您仍然需要在Objective C中进行一些修剪(例如“g”),但是您应该获得包含可打印字符的所有文本节点的有序结果集。