Question

我有一个使用Mechanize搜索Google的工作程序，但是当程序搜索Google时，它还会提取类似RequestBody.create(MediaType.parse("image/jpg"), persistImage(bitmap2, "test"))的网站。

我想拒绝该网站存储在该文件中。所有网站的网址结构都不同。

源代码：

http://webcache.googleusercontent.com/

文字档案：

require 'mechanize'

PATH = Dir.pwd
SEARCH = "test"

def info(input)
  puts "[INFO]#{input}"
end

def get_urls
  info("Searching for sites.")
  agent = Mechanize.new
  page = agent.get('http://www.google.com/')
  google_form = page.form('f')
  google_form.q = "#{SEARCH}"
  url = agent.submit(google_form, google_form.buttons.first)
  url.links.each do |link|
    if link.href.to_s =~ /url.q/
      str = link.href.to_s
      str_list = str.split(%r{=|&}) 
      urls_to_log = str_list[1]
      success("Site found: #{urls_to_log}")
      File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
    end
  end
  info("Sites dumped into #{PATH}/temp/sites.txt")
end

get_urls

Answer 1

现在有效。我遇到success('log')的问题，我不知道为什么，但评论它。

  str_list = str.split(%r{=|&}) 
  next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
  # success("Site found: #{urls_to_log}")
  File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}

Answer 2

经过充分测试的车轮用于将URL拆分为组件部件，因此请使用它们。 Ruby附带URI，可让我们轻松提取WeakReferences<View>，host或path：

query

Ruby的Enumerable模块包含reject和select，可以轻松循环遍历数组或可枚举对象，并拒绝或从中选择元素：

require 'uri'

URL = 'http://foo.com/a/b/c?d=1'

URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"

使用所有这些，您可以检查子主机的URL主机并拒绝任何您不想要的内容：

(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]

使用这些方法和技巧，您可以拒绝或从输入文件中进行选择，或者只是查看单个网址并选择忽略或尊重它们。

拒绝存储在文件中的信息

2 个答案: