我在Ruby(v 2.2)中使用Nokogiri(v 1.6.6)来从HTML文件中抓取数据。目标数据位于<p>
个元素中,如下所示。我能够用以下内容篡改所有文字内容:
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
doc.css("div.listing > p").each do |p|
puts p.text
end
__END__
<div class="listing">
<p><span>1</span> Details1 <span>info1</span></p>
<p><span>2</span> Details2 <span>info2</span></p>
<p><span>3</span> Details3 <span>info3</span></p>
</div>
返回:
1 Details1 info1
2 Details2 info2
3 Details3 info3
虽然我可以轻松地解析<span>
标签内的文字,但我还没弄清楚如何在它们之间获取“详细信息#”文本。通过正则表达式很容易做到,但我想看看是否有办法直接从Nokigiri进行。目标是返回:
Details1
Details2
Details3
使用Nokogiri的内置功能可以实现吗?
答案 0 :(得分:1)
我认为,如果你稍微潜入“Getting Mugged by Nokogiri”,你会找到答案,但我会解决你的问题:
irb(main):061:0> doc = Nokogiri::HTML("<div class='listing'> <p><span>1</span> Details1 <span>info1</span></p> <p><span>2</span> Details2 <span>info2</span></p> <p><span>3</span> Details3 <span>info3</span></p> </div>")
这会给你一个名为doc
的Nokogiri对象:
=> #<Nokogiri::HTML::Document:0x2ab03653f26c name="document" children=[#<Nokogiri::XML::DTD:0x2ab03653ef4c name="html">, #<Nokogiri::XML::Element:0x2ab03653ece0 name="html" children=[#<Nokogiri::XML::Element:0x2ab03653eb00 name="body" children=[#<Nokogiri::XML::Element:0x2ab03653e920 name="div" attributes=[#<Nokogiri::XML::Attr:0x2ab03653e8bc name="class" value="listing">] children=[#<Nokogiri::XML::Text:0x2ab03653e484 " ">, #<Nokogiri::XML::Element:0x2ab03653e3d0 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653e1f0 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653e010 "1">]>, #<Nokogiri::XML::Text:0x2ab03653de58 " Details1 ">, #<Nokogiri::XML::Element:0x2ab03653dda4 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653db9c "info1">]>]>, #<Nokogiri::XML::Text:0x2ab03653d8f4 " ">, #<Nokogiri::XML::Element:0x2ab03653d840 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653d660 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d480 "2">]>, #<Nokogiri::XML::Text:0x2ab03653d2dc " Details2 ">, #<Nokogiri::XML::Element:0x2ab03653d228 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d048 "info2">]>]>, #<Nokogiri::XML::Text:0x2ab03653cdb4 " ">, #<Nokogiri::XML::Element:0x2ab03653cd00 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653cb20 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c940 "3">]>, #<Nokogiri::XML::Text:0x2ab03653c79c " Details3 ">, #<Nokogiri::XML::Element:0x2ab03653c6e8 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c508 "info3">]>]>, #<Nokogiri::XML::Text:0x2ab03653c274 " ">]>]>]>]>
然后你就可以迭代对象:
“Traverse方法以递归方式遍历节点的所有子节点。我们检查节点是否为文本节点,以及其父节点是否为段落。”
irb(main):068:0> doc.at_css("body").traverse do |node|
irb(main):069:1* if node.text? && (node.parent.name == "p")
irb(main):070:2> puts node.content
irb(main):071:2> end
irb(main):072:1> end
Details1
Details2
Details3
=> nil
irb(main):073:0>
我不得不说我不知道traverse
因此我的问题对我很有帮助,因为我每天都使用Nokogiri。我希望你觉得这个答案很有用。
答案 1 :(得分:0)
这是我最终的结果:
/**
* WooCommerce class-wc-api-products.php
* See https://github.com/justinshreve/woocommerce/blob/master/includes/api/class-wc-api-products.php
* Upload image from URL
*
* @since 2.2
* @param string $image_url
* @return int|WP_Error attachment id
*/
function upload_product_image($image_url) {
$file_name = basename(current(explode('?', $image_url)));
$wp_filetype = wp_check_filetype($file_name, null);
$parsed_url = @parse_url($image_url);
// Check parsed URL
if(!$parsed_url || !is_array($parsed_url)) {
throw new WC_API_Exception('woocommerce_api_invalid_product_image', sprintf(__('Invalid URL %s', 'woocommerce'), $image_url), 400);
}
// Ensure url is valid
$image_url = str_replace(' ', '%20', $image_url);
// Get the file
$response = wp_safe_remote_get($image_url, array(
'timeout' => 10
));
if(is_wp_error($response) || 200 !== wp_remote_retrieve_response_code($response)) {
throw new WC_API_Exception('woocommerce_api_invalid_remote_product_image', sprintf(__('Error getting remote image %s', 'woocommerce'), $image_url), 400);
}
// Ensure we have a file name and type
if(!$wp_filetype['type']) {
$headers = wp_remote_retrieve_headers($response);
if(isset($headers['content-disposition']) && strstr($headers['content-disposition'], 'filename=')) {
$disposition = end(explode('filename=', $headers['content-disposition']));
$disposition = sanitize_file_name($disposition);
$file_name = $disposition;
}
elseif(isset($headers['content-type']) && strstr($headers['content-type'], 'image/')) {
$file_name = 'image.' . str_replace('image/', '', $headers['content-type']);
}
unset($headers);
}
// Upload the file
$upload = wp_upload_bits($file_name, '', wp_remote_retrieve_body($response));
if($upload['error']) {
throw new WC_API_Exception('woocommerce_api_product_image_upload_error', $upload['error'], 400);
}
// Get filesize
$filesize = filesize($upload['file']);
if(0 == $filesize) {
@unlink($upload['file']);
unset($upload);
throw new WC_API_Exception('woocommerce_api_product_image_upload_file_error', __('Zero size file downloaded', 'woocommerce'), 400);
}
unset($response);
return $upload;
}
根据“Get text directly inside a tag in Nokogiri”,doc.css("div.listing > p").each do |p|
puts p.at_xpath('./text()').text.strip
end
方法将
让所有带有文字的直接孩子,但不是任何进一步的子孩子
这就是我所看到的行为,它产生了预期的结果。