Question

我有以下HTML代码：

    <td class="role" style=""><a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Microsoft">Microsoft</a><br />
    <a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Corbis">Corbis</a><br />
    Co-Chair of the <a href="/wiki/Bill_%26_Melinda_Gates_Foundation">Bill &amp; Melinda   Gates Foundation</a><br />
    <a href="/wiki/Creative_Director" title="Creative Director" class="mw- redirect">Director</a> of <a href="/wiki/Berkshire_Hathaway">Berkshire Hathaway</a><br/>
    <a href="/wiki/CEO" class="mw-redirect" title="CEO">CEO</a> of <a  href="/wiki/Cascade_Investment">Cascade Investment</a></td>

对于上面的td元素，语义上有五行，由"<br/>"分隔，我想得到五行：

Chairman of Microsoft

Chariman of Borbis

Co-Char of the Bill&Melinda Gates Fundation

Creative Director of Berkshire Hathaway

CEO of Cascade Investment

目前，我的解决方案是首先获取此br内的所有td，因为：

    br_value = td_node.select('.//br')

然后对于每个br_value，我使用以下代码来获取所有文本：

    for br_item in br_value:
        one_item = br_item.select('.//preceding-sibling::*/text()').extract()

在这种情况下，我可以得到以下行：

Chairman Microsoft

Chariman Borbis

Bill&Melinda Gates Fundation

Director Berkshire Hathaway

CEO Cascade Investment

与我想要的原始文本相比，他们基本上错过了“of of”，还有其他一些文本。

原因是“previous-sibling”只返回兄弟标签，但不能返回属于其父节点的“text”，例如“of”。在这种情况下。

此处有人知道如何提取由br代码分隔的完整信息吗？

由于

Answer 1

使用this xpath查询：

//div[@id='???']/descendant-or-self::*[not(ancestor-or-self::script or ancestor-or-self::noscript or ancestor-or-self::style)]/text()

即。要从当前节点和所有后代节点中仅选择文本，请使用以下类型的查询：./descendant-or-self::*/text()

或更短（感谢Empo）：.//text()

Answer 2

我写了这个小函数：

function getCleanLines($rawContent)
{
    $cleanLines = array();
    $regEx = '/<td\sclass="role"[^>]*>(?<CONTENT>.*?)<\/td>/ms';
    preg_match_all($regEx, $rawContent, $matches);

    if(isset($matches['CONTENT'][0]))
    {
        $content = $matches['CONTENT'][0];
        $regEx = '/(?<DATA>.*?)(?:<br\s*\/>|\z)/ms';
        preg_match_all($regEx, $content, $matchedLines);

        if(isset($matchedLines['DATA']))
        {
            foreach($matchedLines['DATA'] as $singleLine)
            {

                $regEx = '#(<a[^>]*>)|(</a>)#';
                $cleanLine = preg_replace($regEx,'',$singleLine);
                if(!empty($cleanLine))
                {
                    $cleanLines[] = preg_replace('/\s\s+/', ' ',$cleanLine);
                }
            }
        }
    }
    return $cleanLines;
}

像这样使用：

$input = 'HERE PUT YOUR HTML FROM PREVIOUS POST';
print_r(getCleanLines($input));

如何解析以下html代码获取“br”标记之前的所有文本

2 个答案: