我正在尝试使用单个文件Ruby脚本来抓取给定站点上的所有电子邮件地址。在文件的底部,我有一个硬编码的测试用例,使用的URL在该特定页面上列出了一个电子邮件地址(所以它应该在第一次循环的第一次迭代时找到一个电子邮件地址。
出于某种原因,我的正则表达式似乎不匹配:
#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'
class GetEmails
def initialize
@urlCounter, @anemoneCounter = 0
$allUrls, $emailUrls, $emails = []
end
def has_email?(listingUrl)
hasListing = false
Anemone.crawl(listingUrl) do |anemone|
anemone.on_every_page do |page|
body_text = page.body.to_s
matchOrNil = body_text.match(/\A[^@\s]+@[^@\s]+\z/)
if matchOrNil != nil
$emailUrls[$anemoneCounter] = listingUrl
$emails[$anemoneCounter] = body_text.match
$anemoneCounter += 1
hasListing = true
else
end
end
end
return hasListing
end
end
emailGrab = GetEmails.new()
emailGrab.has_email?("http://genuinestoragesheds.com/contact/")
puts $emails[0]
答案 0 :(得分:1)
所以这是代码的工作版本。使用单个正则表达式查找包含电子邮件的字符串,再使用三个正则表达式进行清理。
#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'
class GetEmails
def initialize
@urlCounter = 0
$anemoneCounter = 0
$allUrls = []
$emailUrls = []
$emails = []
end
def email_clean(email)
email = email.gsub(/(\w+=)/,"")
email = email.gsub(/(\w+:)/, "")
email = email.gsub!(/\A"|"\Z/, '')
return email
end
def has_email?(listingUrl)
hasListing = false
Anemone.crawl(listingUrl) do |anemone|
anemone.on_every_page do |page|
body_text = page.body.to_s
#matchOrNil = body_text.match(/\A[^@\s]+@[^@\s]+\z/)
matchOrNil = body_text.match(/[^@\s]+@[^@\s]+/)
if matchOrNil != nil
$emailUrls[$anemoneCounter] = listingUrl
$emails[$anemoneCounter] = matchOrNil
$anemoneCounter += 1
hasListing = true
else
end
end
end
return hasListing
end
end
emailGrab = GetEmails.new()
found_email = "href=\"mailto:genuinestoragesheds@gmail.com\""
puts emailGrab.email_clean(found_email)
答案 1 :(得分:0)
\A
和\z
。显然,网页包含的内容只是一个电子邮件字符串,或者根本没有进行正则表达式测试。
您可以将其简化为/[^@\s]+@[^@\s]+/
,但您仍需要清除提取电子邮件的字符串。