Question

我在html解析中需要帮助。在发布问题之前，我试图找到这个答案，但无法找到。我已将完整的html博客页面存储在数据库表中。现在我想从那个html中提取文本和图像。但是我必须从整个HTML中提取段落特定的文本和图像。

参见下面的示例，其中包含大量代码标记。它有三个段落。我必须仅从第2段中提取与我的要求相关的文本和图像。（我有关键字，我可以搜索该关键字，这样我就可以确定我需要提取这一段。）

如何从任何博客中提取特定的段落文本和图像。我有在html中搜索的关键字，即Keyword = PRODUCT ABC。我正在使用php。

<html>
<!-- Javascript: tag come here --->
<!-- Head: tag come here --->
<!-- Meta: tag come here --->
<!-- Title: tag come here --->
<!-- Links: tag come here --->
<!-- Javascript: tag come here --->

<body>

<!-- Lot of other code come here about links, javascript, headings etc -->
<!-- DIV: tag come here --->

<p> "PARAGRAPH 1, This paragraph contain only some text." </p>
<!-- Script: tag come here --->

<p> PARAGRAPH 2, It has some information about PRODUCT ABC...</p>
<img /> <!-- some images come here related to this paragraph.-->
<img /> <!-- some images come here related to this paragraph.-->
<img /> <!-- some images come here related to this paragraph.-->
<!-- Script: tag come here --->

<p> PARAGRAPH 3, This paragraph contain only some text. </p>
<img /> <!-- some images come here related to this paragraph.-->
<!-- Links: tag come here --->
<!-- Javascript: tag come here --->

</body>
</head>
</html>

Answer 1

我同意dreamwiever。虽然，这是html论坛。：P

使用此代码：

$ html = file_get_html（'http://www.google.com/'）; $ par = $ html-＆gt; find（'p [id = hello]'）; foreach（$ par-＆gt; find（'img'）as $ element）echo $ element-＆gt; src。 '
'

Answer 2

如果你正在寻找一个简单的tag来提取，你可以使用regex

简单地说：

$html = "<html><head></head><body><div>sometext</div><div><p>myPtag</p></div><div> some other text</div></body></html>";

preg_match('/<p>(.*?)<\/p>/',$html,$getTheP);

//and simply call what you want from extraction 
var_dump($getTheP);

和，如果您希望在<p>标记中匹配某些内容，可以简单地创建一条新路径来获得你想要的东西：

例如，我们想要包含<p>

的somestring

preg_match('/<p>(.*?)somestring<\/p>',$html,$matchesWithSomeString);

var_dump ( $matchesWithSomeString )

需要帮助从整个博客页面html中提取特定的段落文本和图像

2 个答案: