Question

标题说全部。如何使用PHP在HTML节点之间获取文本？有任何想法吗？下面是我的HTML结构。

<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <div id="outer">
        <div id="first">
            <p class="this">Hello</p>
            <p class="this">Community</p>
        </div>
        <div id="second">
            <p class="that">Stack</p>
            <p class="that">Overflow</p>
        </div>
    </div>
</body>

预期输出：

HelloStackOverflowCommunity

Answer 1

这很简单，在这里获取PHP Simple HTML DOM Parser：http://sourceforge.net/projects/simplehtmldom/files/

然后使用以下代码：

/* include simpledom*/
include('simple_html_dom.php');

/* load html string */
$html_string = <<<HTML
<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <div id="outer">
        <div id="first">
            <p class="this">Hello</p>
            <p class="this">Community</p>
        </div>
        <div id="second">
            <p class="that">Stack</p>
            <p class="that">Overflow</p>
        </div>
    </div>
</body>
</html>
HTML;

/* create simple dom object from html */
$html = str_get_html($html_string);

/* find all paragraph elements */
$paragraph = $html->find('div[id=outer] div p');

/* loop through all elements and get inner text */
foreach($paragraph as $p){
    echo $p->innertext;
}

干杯，

罗伊

Answer 2

你可以尝试：

$text = strip_tags($html);

http://www.php.net/manual/en/function.strip-tags.php

这会让你走得很远。它会留下空间和返回，但这些很容易删除。

$clean = str_replace(array(' ',"\n","\r"),'',$text);

http://www.php.net/manual/en/function.str-replace.php

在你的例子中使用它给出：

TestPageHelloCommunityStackOverflow

如果你想保留一些空格，你可以尝试：

$clean = trim(implode('',explode("\n",$text)));

导致：

Test Page Hello Community Stack Overflow

可能有很多变化。

Answer 3

强烈建议不要使用正则表达式来解析HTML 使用简单HTML库：http://sourceforge.net/projects/simplehtmldom/files/simplehtmldom/
包含它：include 'simple_html_dom.php';
获取所需的标签：$tags = $html->find('p');
创建数组：$a = array(); foreach ($tags as $tag) $a[] = $tag->innertext;;
创建字符串：$string = $a[0] . $a[2] . $a[3] . $a[1];

Answer 4

我建议你使用PHP内置的DOMDocument而不是像simplehtmldom这样的第三方类。

在大型HTML文件上，它们非常慢（我使用过它们）。

<?php
$html ='
<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <div id="outer">
        <div id="first">
            <p class="this">Hello</p>
            <p class="this">Community</p>
        </div>
        <div id="second">
            <p class="that">Stack</p>
            <p class="that">Overflow</p>
        </div>
    </div>
</body>
';

// a new dom object
$dom = new domDocument; 
$dom->preserveWhiteSpace = false;

// load the html into the object
$dom->loadHTML($html); 
// get the body tag
$body = $dom->getElementsByTagName('body')->item(0);
 // loop through all tags
foreach($body->getElementsByTagName('*') as $element ){
    // print the textValue
    print $element->firstChild->textContent;
}

输出为HelloCommunity StackOverflow

Answer 5

试试这个

function getTextBetweenTags($string, $tagname)
 {
    $pattern = "/<$tagname>(.*?)<\/$tagname>/";
    preg_match($pattern, $string, $matches);
    return $matches[1];
 }

你必须遍历$ matches数组......

PHP在HTML节点之间获取文本

5 个答案: