我不擅长编写模式来提取数据。 我有很长的文档,下面是我需要提取的特定字符串。
<p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>¥ZZZZZ</span></a></span>
我想提取XXXX, YYYY,
和ZZZZ
值。
我的第一步是获取XXXX<a href="YYYYY" target="_blank"><span>¥ZZZZZ
$pattern = '/<p><span id="minPrice">^</span></a></span>/';
preg_match($pattern, $data, $matches);
echo ($matches[1]);
但它不起作用。
那么如何提取XXXX, YYYY, and ZZZZ
:(
我的文档中充满了错误编码字符,因此我无法使用loadHTML。它只是返回错误。
更新1:所以我能够做到
var_dump(libxml_use_internal_errors(true));
$DOM = new DOMDocument;
$DOM->loadHTML($data);
$items = $DOM->getElementById('minPrice');
$ items是
DOMElement Object
(
[tagName] => span
[schemaTypeInfo] =>
[nodeName] => span
[nodeValue] => 最安価格(税込):¥131,649
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] =>
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => span
[baseURI] =>
[textContent] => 最安価格(税込):¥131,649
)
html是
<span id="minPrice">
�ň����i(�ō�)�F
<a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank">
<span>¥131,649</span>
</a>
</span>
如何提取http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku
和131,649
?
答案 0 :(得分:1)
您可以使用以下代码行为DOM解析器启用内部错误处理:
libxml_use_internal_errors(true);
然后,您可以使用此示例代码访问所需的数据:
$html = <<<DATA
<p><span id="minPrice">最安価格(税込):<a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank"><span>¥131,649</span></a></span>
DATA;
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$spans = $xpath->query('//span[@id="minPrice"]'); // Get all spans with ID=minPrice
$a = array();
foreach($spans as $span) {
foreach($span->childNodes as $child) { // Check the child nodes
if ($child->nodeName == "a") {
array_push($a, $child->getAttribute("href"));
}
}
array_push($a, preg_replace('~^.*?(\d+(?:,\d+)*)$~u', '$1', $child->nodeValue));
}
print_r($a);
结果:
Array
(
[0] => http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku
[1] => 131,649
)
我使用正则表达式来提取字符串末尾的数字,但您也可以使用带有日元符号的explode
。
$num = explode(html_entity_decode("¥"), $child->nodeValue)[1];
array_push($a, $num);
请参阅another demo
答案 1 :(得分:0)
使用此Regexp -
/<p><span.*id=\"minPrice\">(.*)<a.*href="(.*?)".*>.*<span>.*;(.*?)<\/span>.*/
结果 -
XXXX
YYYYY
ZZZZZ
答案 2 :(得分:0)
这可以通过正则表达式完成,正则表达式可以得到完全匹配:
$regex = "/<p><span id=\"minPrice\">(.*?)<a href=\"(.*?)\" target=\"_blank\"><span>¥(.*)<\/span><\/a>/";
preg_match($regex, $data, $matches);
但是,正如评论中所提到的,正则表达式不是执行此任务的合适工具。如果文档很长并且嵌套这些匹配模式(即如果XXXX是这些段落中的另一个),则此正则表达式可能会失败。您可能应该看到如何修复此文档以使其成为正确的XHTML,然后使用正确的XML解析器。您可以通过在每行输入上运行此正则表达式来缓解这种情况(假设它已正确分割为行),但仍然不理想。
答案 3 :(得分:0)
男人用它并抱歉我的英语不好! PHP Simple HTML DOM Parser 和download lib 这种选择。 代码:
require_once '/simple_html_dom.php';
//here put content or block or DOM
$html = str_get_html('<p><span id="minPrice">最安価格(税込)<a href="YYYYY" target="_blank"><span>¥ZZZZZ</span></a></span>');
//OR
//USE get_file_content if need
//$html = file_get_html('example.html');
//select links, and use first element
$link = $html->find('p span#minPrice a',0);//select links, and use first element
//get url
$href = $link->href;
//get text in span
$span_in_link = $link->find('span',0)->plaintext;
//delete <a></a>
$link->outertext = '';
//get text in span
$span_id_minPrice = $html->find('p span#minPrice',0)->plaintext;
//delete ¥
$span_in_link = str_replace('¥','',$span_in_link);
//result
echo $span_id_minPrice.'<br>';//最安価格(税込)
echo $href.'<br>';//YYYYY
echo $span_in_link.'<br>';//ZZZZZ
如果你有这个&gt; 1,然后使用它:
//select all span
$html = str_get_html('
<p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>¥ZZZZZ</span></a></span>
<p><span id="minPrice">XXXX2<a href="YYYYY2" target="_blank"><span>¥ZZZZZ2</span></a></span>
');
$all_span = $html->find('p span#minPrice');
$data = array();
foreach($all_span as $element)
{
$array = array();
$link = $element->find('a',0);//select links, and use first element
//get url
$href = $link->href;
//get text in span
$span_in_link = $link->plaintext;
//delete a
$link->innertext = '';
//get text in span
$span_id_minPrice = $element->plaintext;
//delete ¥
$span_in_link = str_replace('¥','',$span_in_link);
$array['span#minPrice'] = $span_id_minPrice ;
$array['href'] = $href;
$array['span_in_link'] = $span_in_link;
$data [] = $array;
}
echo '<pre>';
print_r($data);
结果:
阵 (
[0] => Array
(
[span#minPrice] => XXXX
[href] => YYYYY
[span_in_link] => ZZZZZ
)
[1] => Array
(
[span#minPrice] => XXXX2
[href] => YYYYY2
[span_in_link] => ZZZZZ2
)
)