我有很多网页的HTML文件,其中包含许多信息。我试图提取一些内容并将其放入xml文件或excel电子表格。所有网页在设计上都非常相似,信息放在所有页面的相同位置。有人知道有什么办法吗?
答案 0 :(得分:2)
有许多刮刀库可以帮助您从html页面中提取数据
网页抓取和抓取并不总是那么简单,所以这取决于你想要实现的目标。不同的产品,SDK,库等,专注于抓取或爬行的不同方面。以下是您可以查看的一些内容:
Apify - (以前称为Apifier)是一个基于云的网络抓取工具,可以使用几行简单的JavaScript从任何网站中提取结构化数据。
Diffbot - 自动从网页中提取数据并返回结构化JSON。 `
Espion - 无头浏览器,可让您将JavaScript代码直接注入目标网页。
此外,如果您了解Node Js,那么node-osmosis真的很酷且易于使用库
答案 1 :(得分:1)
我强烈推荐你这个库:
http://sourceforge.net/projects/simplehtmldom/
/**
* Website: http://sourceforge.net/projects/simplehtmldom/
* Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)
* Contributions by:
* Yousuke Kumakura (Attribute filters)
* Vadim Voituk (Negative indexes supports of "find" method)
* Antcs (Constructor with automatically load contents either text or file/url)
*
* all affected sections have comments starting with "PaperG"
*
* Paperg - Added case insensitive testing of the value of the selector.
* Paperg - Added tag_start for the starting index of tags - NOTE: This works but not accurately.
* This tag_start gets counted AFTER \r\n have been crushed out, and after the remove_noice calls so it will not reflect the REAL position of the tag in the source,
* it will almost always be smaller by some amount.
* We use this to determine how far into the file the tag in question is. This "percentage will never be accurate as the $dom->size is the "real" number of bytes the dom was created from.
* but for most purposes, it's a really good estimation.
* Paperg - Added the forceTagsClosed to the dom constructor. Forcing tags closed is great for malformed html, but it CAN lead to parsing errors.
* Allow the user to tell us how much they trust the html.
* Paperg add the text and plaintext to the selectors for the find syntax. plaintext implies text in the innertext of a node. text implies that the tag is a text node.
* This allows for us to find tags based on the text they contain.
* Create find_ancestor_tag to see if a tag is - at any level - inside of another specific tag.
* Paperg: added parse_charset so that we know about the character set of the source document.
* NOTE: If the user's system has a routine called get_last_retrieve_url_contents_content_type availalbe, we will assume it's returning the content-type header from the
* last transfer or curl_exec, and we will parse that and use it in preference to any other method of charset detection.
*
* Found infinite loop in the case of broken html in restore_noise. Rewrote to protect from that.
* PaperG (John Schlick) Added get_display_size for "IMG" tags.
*
* Licensed under The MIT License
* Redistributions of files must retain the above copyright notice.
*
* @author S.C. Chen <me578022@gmail.com>
* @author John Schlick
* @author Rus Carroll
* @version 1.5 ($Rev: 196 $)
* @package PlaceLocalInclude
* @subpackage simple_html_dom
*/
/**
* All of the Defines for the classes below.
* @author S.C. Chen <me578022@gmail.com>
*/
这是一个例子
$html = file_get_html($ad_bachecubano_url);
//Proceder a capturar el texto
$anuncio['header'] = $html->find('.headingText', 0)->plaintext;
$anuncio['body'] = $html->find('.showAdText', 0)->plaintext;
$precio = $html->find('#lineBlock');
foreach ($precio as $possibleprice) {
$item = $possibleprice->find('.headingText2', 0)->plaintext;
$precio = 0;
if ($item == "Precio: ") {
$precio = $possibleprice->find('.normalText', 0)->plaintext;
$anuncio['price'] = $this->getFinalPrice($precio);
} else {
continue;
}
}
$contactbox = $html->find('#contact');
foreach ($contactbox as $contact) {
$boxes = $contact->find('#lineBlock');
foreach ($boxes as $box) {
$key = $box->find('.headingText2', 0)->plaintext;
$value = $box->find('.normalText', 0)->plaintext;
if ($key == "Nombre: ") {
$anuncio['nombre'] = $value;
}
if ($key == "Teléfono: ") {
$anuncio['phone'] = $value;
}
}
}
$anuncio['email'] = scrapeemail($anuncio['body'])[0][0];
if (!isset($anuncio['email']) || $anuncio['email'] == '') {
$anuncio['email'] = "";
}