如何使用此字符串中的提取数据

时间:2016-03-18 08:50:06

标签: php regex

我不擅长编写模式来提取数据。 我有很长的文档,下面是我需要提取的特定字符串。

<p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>

我想提取XXXX, YYYY,ZZZZ值。

我的第一步是获取XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ

$pattern = '/<p><span id="minPrice">^</span></a></span>/';
preg_match($pattern, $data, $matches);
echo ($matches[1]);

但它不起作用。 那么如何提取XXXX, YYYY, and ZZZZ :(

我的文档中充满了错误编码字符,因此我无法使用loadHTML。它只是返回错误。

更新1:所以我能够做到

        var_dump(libxml_use_internal_errors(true));
        $DOM = new DOMDocument;
        $DOM->loadHTML($data);
        $items = $DOM->getElementById('minPrice');

$ items是

 DOMElement Object
(
    [tagName] => span
    [schemaTypeInfo] => 
    [nodeName] => span
    [nodeValue] => 最安価格(税込):¥131,649
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => span
    [baseURI] => 
    [textContent] => 最安価格(税込):¥131,649
)

html是

<span id="minPrice">
    �ň����i(�ō�)�F
    <a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank">
        <span>&yen;131,649</span>
    </a>
</span>

如何提取http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku131,649

4 个答案:

答案 0 :(得分:1)

您可以使用以下代码行为DOM解析器启用内部错误处理:

libxml_use_internal_errors(true);

然后,您可以使用此示例代码访问所需的数据:

$html = <<<DATA
<p><span id="minPrice">最安価格(税込):<a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank"><span>&yen;131,649</span></a></span>
DATA;

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$spans = $xpath->query('//span[@id="minPrice"]');   // Get all spans with ID=minPrice
$a = array();
foreach($spans as $span) { 
    foreach($span->childNodes as $child) {          // Check the child nodes
        if ($child->nodeName == "a") {
            array_push($a, $child->getAttribute("href"));
        }
    }
    array_push($a, preg_replace('~^.*?(\d+(?:,\d+)*)$~u', '$1', $child->nodeValue));
}

print_r($a);

结果:

Array
(
    [0] => http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku
    [1] => 131,649
)

我使用正则表达式来提取字符串末尾的数字,但您也可以使用带有日元符号的explode

$num = explode(html_entity_decode("&yen;"), $child->nodeValue)[1];
array_push($a, $num);

请参阅another demo

答案 1 :(得分:0)

使用此Regexp -

/<p><span.*id=\"minPrice\">(.*)<a.*href="(.*?)".*>.*<span>.*;(.*?)<\/span>.*/

结果 -

  1. XXXX
  2. YYYYY
  3. ZZZZZ

答案 2 :(得分:0)

这可以通过正则表达式完成,正则表达式可以得到完全匹配:

$regex = "/<p><span id=\"minPrice\">(.*?)<a href=\"(.*?)\" target=\"_blank\"><span>&yen;(.*)<\/span><\/a>/";
preg_match($regex, $data, $matches);

但是,正如评论中所提到的,正则表达式不是执行此任务的合适工具。如果文档很长并且嵌套这些匹配模式(即如果XXXX是这些段落中的另一个),则此正则表达式可能会失败。您可能应该看到如何修复此文档以使其成为正确的XHTML,然后使用正确的XML解析器。您可以通过在每行输入上运行此正则表达式来缓解这种情况(假设它已正确分割为行),但仍然不理想。

答案 3 :(得分:0)

男人用它并抱歉我的英语不好! PHP Simple HTML DOM Parserdownload lib 这种选择。 代码:

require_once '/simple_html_dom.php';

//here put content or block or DOM  
$html = str_get_html('<p><span id="minPrice">最安価格(税込)<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>');
//OR
//USE get_file_content if need
//$html = file_get_html('example.html');
//select links, and use first element
$link = $html->find('p span#minPrice a',0);//select links, and use first element
//get url
$href =  $link->href;
//get text in span
$span_in_link = $link->find('span',0)->plaintext;
//delete <a></a>
$link->outertext = '';
 //get text in span
$span_id_minPrice = $html->find('p span#minPrice',0)->plaintext;
//delete  &yen;
$span_in_link =  str_replace('&yen;','',$span_in_link);
 //result
echo $span_id_minPrice.'<br>';//最安価格(税込)
echo $href.'<br>';//YYYYY
echo $span_in_link.'<br>';//ZZZZZ 

如果你有这个&gt; 1,然后使用它:

 //select all span
$html = str_get_html('
            <p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>
            <p><span id="minPrice">XXXX2<a href="YYYYY2" target="_blank"><span>&yen;ZZZZZ2</span></a></span>
            ');
    $all_span = $html->find('p span#minPrice');
     $data = array();
    foreach($all_span as $element)
    {
        $array = array();
        $link = $element->find('a',0);//select links, and use first element
        //get url
        $href =  $link->href;
        //get text in span
        $span_in_link = $link->plaintext;
        //delete a
        $link->innertext = '';
        //get text in span
        $span_id_minPrice = $element->plaintext;
        //delete  &yen;
        $span_in_link =  str_replace('&yen;','',$span_in_link);

        $array['span#minPrice'] = $span_id_minPrice ;
        $array['href'] =  $href;
        $array['span_in_link'] =  $span_in_link;

        $data [] = $array;

    }

    echo '<pre>';
    print_r($data);

结果:

阵  (

[0] => Array
    (
        [span#minPrice] => XXXX 
        [href] => YYYYY
        [span_in_link] => ZZZZZ 
    )

[1] => Array
    (
        [span#minPrice] => XXXX2 
        [href] => YYYYY2
        [span_in_link] => ZZZZZ2 
    )