我有一个充满文章的数据库表。在某些情况下,文章的底部有一个我要解析以获取信息的块。例如,以下是文章表中的两个可能值:
<p>Test test <blockquote class="pull">text quote</blockquote></p>
<p> </p>
<p><span class="italic">italic text</span></p>
<div class="bottom-block"><div class="picture" style="background-image:url('/generator?f=somepicture.jpg');"></div><div class="blurb">Blurb about person<a href="http://website.com">http://website.com</a></div></div>
和另一个例子:
<p>Some content</p>
<div class="bottom-block"><img alt="John Doe" class="picture" src="/assets/images/JOHN_DOE_1.jpg"><div class="blurb"><p>John Doe is a guy from Texas. <a href="http://johnswebsite.com" target="_blank">John's Website</a> and has a large following.</p></div></div>
以上是数据库中的两个值示例。现在,我希望能够提取某些信息。更确切地说,我想提取Name,Url,ImageName和Blurb
在第一个示例中,在对该值运行查询后,我希望看到:
姓名:
网址:http://website.com
图片名称:somepicture.jpg
Blurb:Blurb about person<a href="http://website.com">http://website.com</a>
在第二个例子中:
姓名:John Doe
网址:http://johnswebsite.com
图片名称:JOHN_DOE_1.jpg
Blurb:<p>John Doe is a guy from Texas. <a href="http://johnswebsite.com" target="_blank">John's Website</a> and has a large following.</p>
我正在玩一个SQL查询,这个查询做得不错,但仍然存在很多不一致。
SELECT id, url, content, TRIM(BOTH '\n' FROM TRIM(TRAILING '</div>\n</div>' FROM TRIM(TRAILING '</div></div>' FROM TRIM(SUBSTRING(content, LOCATE('class="bottom-block"',content)+18))))) as block_extract, TRIM(BOTH '\n' FROM TRIM(TRAILING '</div>\n</div>' FROM TRIM(TRAILING '</div></div>' FROM TRIM(SUBSTRING(content, LOCATE('class="blurb"',content)+12))))) as blurb FROM articles WHERE content LIKE '%bottom-block%' GROUP BY block_extract;
答案 0 :(得分:1)
好的,所以我不知道如何使用SQL查询来实现这一点,但这是我如何使用PHP来实现的。基本前提是使用五个单独的匹配查询然后将它们打印出来。匹配的查询如下:
以下是一些要演示的代码。
// GET THE BOTTOM BLOCK CONTENT
preg_match('~(?<=<div class="bottom-block">).*?(?=</div>$)~ims', $mysql_row, $bottom_block_array);
$string = $bottom_block_array[0];
// GRAB THE IMAGES
preg_match_all('~[A-Z0-9_]+\.(?:jpg|jpeg|gif|png)(?=\'|")~i', $string, $images);
$images = $images[0];
// GRAB THE URLS
preg_match_all('~(?<=href=").*?(?=")~ims', $string, $urls);
$urls = $urls[0];
// GRAB THE BLURBS
preg_match_all('~(?<=<div class="blurb">).*?(?=</div>)~ims', $string, $blurbs);
$blurbs = $blurbs[0];
// GRAB THE NAMES
preg_match_all('~(?<=alt=").*?(?=")~ims', $string, $names);
$names = $names[0];
// LOOP THROUGH AND PRINT OUT ALL OF THE NAMES (OR ONLY ONE, IF DESIRED)
if ($names) {
foreach ($names AS $name) {print "\nName: ".$name;} // USE THIS IF YOU WANT ALL OF THE NAMES
// print "\nName: ".$names[0]; // USE THIS IF YOU ONLY WANT ONE POSSIBLE NAME TO SHOW UP
}
else {print "\nName:";}
if ($urls) {
foreach ($urls AS $url) {print "\nUrl: ".$url;} // PRINT OUT ALL URLS
// print "\nUrl: ".$urls[0]; // PRINT OUT ONLY ONE URL
}
else {print "\nUrl:";}
if ($images) {
foreach ($images AS $image) {print "\nImageName: ".$image;} // PRINT OUT ALL THE IMAGES
// print "\nImageName: ".$images[0]; // PRINT OUT ONLY ONE IMAGE
}
else {print "\nImageName:";}
if ($blurbs) {
foreach ($blurbs AS $blurb) {print "\nBlurb: ".$blurb;} // PRINT OUT ALL OF THE BLURBS
// print "\nBlurb: ".$blurbs[0]; // PRINT OUT ONLY ONE BLURB
}
else {print "\nBlurb:";}
print "\n\n\n\n\n";
答案 1 :(得分:1)
这是一种DOM方式:
$results = array();
$fields = array('name', 'img', 'url', 'blurb');
$queries = array('name' => '//img/@alt',
'img' => '//img[@class = "picture"]/@style |
//img/@src |
//div[@class = "picture"]/@style',
'url' => '//div[@class = "blurb"]//a/@href',
'blurb' => '//div[@class = "blurb"]');
$imgPattern = <<<'EOD'
~
(?|
.*? background-image:url\( [^)]*? ([^?="\')/]+ \.(?:png|jpe?g|gif) ).*
|
.*? ([^=;/]+)$
)
~ix
EOD;
foreach ($data as $html) {
$srcDom = new DOMDocument();
@$srcDom->loadHTML($html);
$elts = $srcDom->getElementsbyTagName("body")->item(0)->childNodes;
$tmp['other'] = '';
foreach ($elts as $elt) {
if ( $elt->nodeType === XML_ELEMENT_NODE &&
$elt->hasAttribute('class') &&
$elt->getAttribute('class') == 'bottom-block' )
$bbnode = $elt;
else
$tmp['other'] .= $srcDom->saveHTML($elt);
}
echo htmlspecialchars(print_r($other, true));
if ( $bbnode ):
$bbDom = new DOMDocument();
$bbDom->appendChild($bbDom->importNode($bbnode, true));
$xpath = new DOMXPath($bbDom);
foreach($fields as $field) {
$$field = $xpath->query($queries[$field]);
if ( $field == 'blurb' ):
$tmp[$field] = '';
foreach ($$field->item(0)->childNodes as $child) {
$tmp[$field] .= $bbDom->saveHTML($child);
}
else:
$tmp[$field] = ($$field->length) ? $$field->item(0)->nodeValue : '';
endif;
}
$tmp['img'] = preg_replace($imgPattern, '$1', $tmp['img']);
endif;
$results[] = $tmp;
}
echo htmlspecialchars(print_r($results, true));