如何使用mechanize / nokogiri获得简单但格式化的HTML?

时间:2011-06-03 06:17:11

标签: ruby nokogiri mechanize

require 'rubygems'
require 'mechanize'

  rational = Mechanize.new { |agent|
        agent.user_agent_alias = 'Windows Mozilla'
  }
  results = rational.get(ARGV[0])
  puts results.content 

给了我 html ,但我想要纯文字。最好的是它是否可以格式化。

1 个答案:

答案 0 :(得分:5)

此代码将为您提供整个文档的简单无格式文本:

require 'mechanize'
require 'nokogiri'

rational = Mechanize.new { |agent|
    agent.user_agent_alias = 'Windows Mozilla'
}

document = Nokogiri::HTML(rational.get(ARGV[0]).content)

#This will give you very dirty result
#results = document.inner_text

#My suggestion is to extract text from some specific element
results = document.css("#content .my-element-with-some-contents").inner_text