需要通过网络抓取来获取电子邮件ID和电话号码

时间:2020-08-24 05:47:24

标签: ruby web-scraping web-crawler nokogiri open-uri

require 'open-uri'
require 'nokogiri'

def scrap(url)
  html = open(url).read
  nokogiri_doc = Nokogiri::HTML(html)
  final_array = []

  nokogiri_doc.search("a").each do |element|
    element = element.text
    final_array << element
  end

  final_array.each_with_index do |index|
    puts "#{index}"
  end
end


scrap('http://www.infranetsol.com/')

在这种情况下,我仅获得a标记,但是我需要将电子邮件ID和电话号码放入Excel文件中。

1 个答案:

答案 0 :(得分:0)

您所拥有的只是文字。因此,您可以做的是仅使字符串看起来像电子邮件或电话号码。

对象实例,如果将结果保存在数组中

a = scrap('http://www.infranetsol.com/')

您可以通过电子邮件获取元素(带有'@'的字符串):

a.select { |s| s.match(/.*@.*/) }

您可以获得带电话号码的元素(至少5位数字的字符串):

a.select{ |s| s.match(/\d{5}/) }

整个代码:

require 'open-uri'
require 'nokogiri'

def scrap(url)
  html = open(url).read
  nokogiri_doc = Nokogiri::HTML(html)
  final_array = []

  nokogiri_doc.search("a").each do |element|
    element = element.text
    final_array << element
  end

  final_array.each_with_index do |index|
    puts "#{index}"
  end
end


a = scrap('http://www.infranetsol.com/')
email = a.select { |s| s.match(/.*@.*/) }
phone = a.select{ |s| s.match(/\d{5}/) }

# in your example, you will have to email in email 
# and unfortunately a complex string for phone.
# you can use scan to extract phone from text and flat_map 
# to get an array without sub array
# But keep in mind it will only worked with this text

phone.flat_map{ |elt| elt.scan(/\d[\d ]*/) }

相关问题