在Ruby脚本中使用Nokogiri在节点之间提取文本

时间:2015-12-11 23:29:35

标签: ruby nokogiri

我在Ruby(v 2.2)中使用Nokogiri(v 1.6.6)来从HTML文件中抓取数据。目标数据位于<p>个元素中,如下所示。我能够用以下内容篡改所有文字内容:

require 'nokogiri'

doc = Nokogiri::HTML(DATA.read)

doc.css("div.listing > p").each do |p|
  puts p.text
end

__END__
<div class="listing">
  <p><span>1</span> Details1 <span>info1</span></p>
  <p><span>2</span> Details2 <span>info2</span></p>
  <p><span>3</span> Details3 <span>info3</span></p>
</div>

返回:

1 Details1 info1
2 Details2 info2
3 Details3 info3

虽然我可以轻松地解析<span>标签内的文字,但我还没弄清楚如何在它们之间获取“详细信息#”文本。通过正则表达式很容易做到,但我想看看是否有办法直接从Nokigiri进行。目标是返回:

Details1
Details2
Details3

使用Nokogiri的内置功能可以实现吗?

2 个答案:

答案 0 :(得分:1)

我认为,如果你稍微潜入“Getting Mugged by Nokogiri”,你会找到答案,但我会解决你的问题:

irb(main):061:0> doc = Nokogiri::HTML("<div class='listing'> <p><span>1</span> Details1 <span>info1</span></p> <p><span>2</span> Details2 <span>info2</span></p> <p><span>3</span> Details3 <span>info3</span></p> </div>")

这会给你一个名为doc的Nokogiri对象:

=> #<Nokogiri::HTML::Document:0x2ab03653f26c name="document" children=[#<Nokogiri::XML::DTD:0x2ab03653ef4c name="html">, #<Nokogiri::XML::Element:0x2ab03653ece0 name="html" children=[#<Nokogiri::XML::Element:0x2ab03653eb00 name="body" children=[#<Nokogiri::XML::Element:0x2ab03653e920 name="div" attributes=[#<Nokogiri::XML::Attr:0x2ab03653e8bc name="class" value="listing">] children=[#<Nokogiri::XML::Text:0x2ab03653e484 " ">, #<Nokogiri::XML::Element:0x2ab03653e3d0 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653e1f0 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653e010 "1">]>, #<Nokogiri::XML::Text:0x2ab03653de58 " Details1 ">, #<Nokogiri::XML::Element:0x2ab03653dda4 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653db9c "info1">]>]>, #<Nokogiri::XML::Text:0x2ab03653d8f4 " ">, #<Nokogiri::XML::Element:0x2ab03653d840 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653d660 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d480 "2">]>, #<Nokogiri::XML::Text:0x2ab03653d2dc " Details2 ">, #<Nokogiri::XML::Element:0x2ab03653d228 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653d048 "info2">]>]>, #<Nokogiri::XML::Text:0x2ab03653cdb4 " ">, #<Nokogiri::XML::Element:0x2ab03653cd00 name="p" children=[#<Nokogiri::XML::Element:0x2ab03653cb20 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c940 "3">]>, #<Nokogiri::XML::Text:0x2ab03653c79c " Details3 ">, #<Nokogiri::XML::Element:0x2ab03653c6e8 name="span" children=[#<Nokogiri::XML::Text:0x2ab03653c508 "info3">]>]>, #<Nokogiri::XML::Text:0x2ab03653c274 " ">]>]>]>]>

然后你就可以迭代对象:

  

“Traverse方法以递归方式遍历节点的所有子节点。我们检查节点是否为文本节点,以及其父节点是否为段落。”

irb(main):068:0> doc.at_css("body").traverse do |node|
irb(main):069:1*   if node.text? && (node.parent.name == "p")
irb(main):070:2>     puts node.content
irb(main):071:2>   end
irb(main):072:1> end
Details1 
Details2 
Details3 
=> nil
irb(main):073:0>

我不得不说我不知道​​traverse因此我的问题对我很有帮助,因为我每天都使用Nokogiri。我希望你觉得这个答案很有用。

答案 1 :(得分:0)

这是我最终的结果:

/**
 * WooCommerce class-wc-api-products.php
 * See https://github.com/justinshreve/woocommerce/blob/master/includes/api/class-wc-api-products.php
 * Upload image from URL
 *
 * @since 2.2
 * @param string $image_url
 * @return int|WP_Error attachment id
 */
function upload_product_image($image_url) {
    $file_name = basename(current(explode('?', $image_url)));
    $wp_filetype = wp_check_filetype($file_name, null);
    $parsed_url = @parse_url($image_url);

    // Check parsed URL
    if(!$parsed_url || !is_array($parsed_url)) {
        throw new WC_API_Exception('woocommerce_api_invalid_product_image', sprintf(__('Invalid URL %s', 'woocommerce'), $image_url), 400);
    }

    // Ensure url is valid
    $image_url = str_replace(' ', '%20', $image_url);

    // Get the file
    $response = wp_safe_remote_get($image_url, array(
        'timeout' => 10
    ));

    if(is_wp_error($response) || 200 !== wp_remote_retrieve_response_code($response)) {
        throw new WC_API_Exception('woocommerce_api_invalid_remote_product_image', sprintf(__('Error getting remote image %s', 'woocommerce'), $image_url), 400);
    }

    // Ensure we have a file name and type
    if(!$wp_filetype['type']) {
        $headers = wp_remote_retrieve_headers($response);
        if(isset($headers['content-disposition']) && strstr($headers['content-disposition'], 'filename=')) {
            $disposition = end(explode('filename=', $headers['content-disposition']));
            $disposition = sanitize_file_name($disposition);
            $file_name = $disposition;
        }
        elseif(isset($headers['content-type']) && strstr($headers['content-type'], 'image/')) {
            $file_name = 'image.' . str_replace('image/', '', $headers['content-type']);
        }
        unset($headers);
    }

    // Upload the file
    $upload = wp_upload_bits($file_name, '', wp_remote_retrieve_body($response));

    if($upload['error']) {
        throw new WC_API_Exception('woocommerce_api_product_image_upload_error', $upload['error'], 400);
    }

    // Get filesize
    $filesize = filesize($upload['file']);

    if(0 == $filesize) {
        @unlink($upload['file']);
        unset($upload);
        throw new WC_API_Exception('woocommerce_api_product_image_upload_file_error', __('Zero size file downloaded', 'woocommerce'), 400);
    }

    unset($response);

    return $upload;
}

根据“Get text directly inside a tag in Nokogiri”,doc.css("div.listing > p").each do |p| puts p.at_xpath('./text()').text.strip end 方法将

  

让所有带有文字的直接孩子,但不是任何进一步的子孩子

这就是我所看到的行为,它产生了预期的结果。