我正在尝试构建一个电子邮件抓取器,该抓取器接收URL的csv文件,并将其与电子邮件地址一起返回;包括在此过程中被抓取的其他网址/地址。我似乎无法让我的Spider遍历csv文件中的每一行,即使在我测试要调用的函数时它们也可以很好地返回。
这是代码;我改编自here:
import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep
# Avoid getting too many logs and warnings when using Scrapy inside Jupyter Notebook.
logging.getLogger('scrapy').propagate = False
# Extract urls from file.
def get_urls():
urls = pd.read_csv('food_urls.csv')
url = list(urls)
for i in url:
return urls
# Test it.
# get_urls()
# Create mail spider.
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
# Search for links inside URLs.
links = LxmlLinkExtractor(allow=()).extract_links(response)
# Take in a list of URLs as input and read their source codes one by one.
links = [str(link.url) for link in links]
links.append(str(response.url))
# Send links from one parse method to another.
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
# Pass URLS to the parse_link method — this is the method we'll apply our regex findall to look for emails
def parse_link(self, response):
html_text = str(response.text)
mail_list = re.findall('\w+@\w+\.{1}\w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
df.to_csv(self.path, mode='a', header=False)
# Save emails in a CSV file
def ask_user(question):
response = input(question + ' y/n' + '\n')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
# Combine everything
def get_info(root_file, path):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting urls...')
urls_list = get_urls()
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=urls_list, path=path)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
最后,当我打电话给df = get_info('food_urls.csv', 'food_emails.csv')
时,刮板要花很长时间才能运行。
完成后,我跑了df.head()
并得到了:
email link
0 NaN NaN
1 alyssa@therecipecritic.com https://therecipecritic.com/food-blogger/
2 shop@therecipecritic.com https://therecipecritic.com/terms/
因此,它正在工作,但仅在列表中搜寻第一个URL。
有人知道我在做什么错吗?
谢谢!
答案 0 :(得分:0)
创建一个带有嵌套列表的python字典并将其导入:
from Base_URLS import URL_List
然后我这样称呼它:
def get_urls():
urls = URL_List['urls']
return urls
像魅力一样工作!
感谢@ rodrigo-nader的帮助