我在txt文件中有一个url列表,然后是一个contact_page_patterns列表。我只需要检查那些特定页面来抓取网址的电子邮件。
请告诉我一些可行的方法。我是Python和Scrapy的新手。先感谢您。
class FinalspiderSpider(scrapy.Spider):
name = "finalspider"
source_urls = open("/Users/NiveRam/Documents/urllist.txt","rb")
start_urls = [url.strip() for url in source_urls.readlines()]
contact_page_pattern = ['help','office','global','feedback','branch','contact','about']
def parse(self, response):
hxs = HtmlXPathSelector(response)
emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
story = FinaltestItem()
story["url"] = response.url
story["title"] = response.xpath("//title/text()").extract()
story["email"] = emails
return(story)
这将从网页的整个主体中检索电子邮件,并输出
等电子邮件电子邮件:[info @ abc.com,infor @ abc.com,yourname @ abc.com]
答案 0 :(得分:-1)
您可以通过url
对象的response
属性访问当前网址:
class MySpider(scrapy.Spider):
url_keywords = ['stackoverflow', 'tea']
def parse(self, response):
story = FinaltestItem()
# check if any of defined keywords can be found in response.url
get_email = any(k in response.url for k in self.url_keywords)
if get_email: # if yes add in email!
emails = re.findall(r'[\w\.-]+@[\w\.-]+', response.body)
story["email"] = emails
story["url"] = response.url
story["title"] = response.xpath("//title/text()").extract()
return story