我对python完全陌生。我想用它来刮擦传真号码。
我找到了一些code,它与我想做的类似。
import logging
import os
import pandas as pd
from pathlib import Path
import re
import scrapy
import html_text
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
数据如下:
address.head(3)
vorname nachname strasse plz
0 Sigrid Seifert Schlegelstr. 7 10115
1 Viola Fischer Schlegelstr. 9 10115
2 Beate Schmidt-Breitung Hannoversche Str. 4 10115
目标是将每列输入google,刮擦前五个条目并提取传真号码。
所以对于第一个条目,我要做:
def get_urls(tag, n, language):
urls = [url for url in search(tag, stop=n, lang=language)][:n]
return urls
urls = get_urls('Sigrid Seifert Schlegelstr. 7 10115', 5, 'de')
class FaxSpider(scrapy.Spider):
name = 'Fax_numbers'
def parse(self, response):
links = LxmlLinkExtractor(allow=()).extract_links(response)
links = [str(link.url) for link in links]
links.append(str(response.url))
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
def parse_link(self, response):
for word in self.reject:
if word in str(response.url):
return
Fax_numbers = re.compile('Fax:P([0-9]*)')
dic = {'Fax_numbers': Fax_numbers, 'link': str(response.url)}
df = pd.DataFrame(dic)
然后我启动搜寻器:
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
path = "C:/Users/X1/Desktop"
process.crawl(FaxSpider, start_urls=urls, path = path)
process.start()
但是我得到这个错误。我认为这与路径有关。
ReactorNotRestartable