我正在开发蜘蛛项目,我已经搬到了新计算机上。现在我正在安装所有东西,我遇到了扭曲的问题。我已经阅读了这个bug,我已经安装了pywin32,然后还安装了WinPython,但它没有帮助。我试图用这个命令更新Twisted
pip install Twisted --update
正如论坛中所建议的,但它说pip install没有--update选项。我也跑了
python python27\scripts\pywin32_postinstall.py -install
但没有成功。这是我的错误:
G:\Job_vacancies\Python\vacancies>scrapy crawl jobs
2015-10-06 09:12:53 [scrapy] INFO: Scrapy 1.0.3 started (bot: vacancies)
2015-10-06 09:12:53 [scrapy] INFO: Optional features available: ssl, http11
2015-10-06 09:12:53 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'va
cancies.spiders', 'SPIDER_MODULES': ['vacancies.spiders'], 'DEPTH_LIMIT': 3, 'BO
T_NAME': 'vacancies'}
2015-10-06 09:12:53 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
Unhandled error in Deferred:
2015-10-06 09:12:53 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_comm
and
cmd.run(args, opts)
File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 153, in crawl
d = crawler.crawl(*args, **kwargs)
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
result = g.send(result)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 71, in crawl
self.engine = self._create_engine()
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 83, in _create_en
gine
return ExecutionEngine(self, lambda _: self.stop())
File "c:\python27\lib\site-packages\scrapy\core\engine.py", line 66, in __init
__
self.downloader = downloader_cls(crawler)
File "c:\python27\lib\site-packages\scrapy\core\downloader\__init__.py", line
65, in __init__
self.handlers = DownloadHandlers(crawler)
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\__init__.p
y", line 23, in __init__
cls = load_object(clspath)
File "c:\python27\lib\site-packages\scrapy\utils\misc.py", line 44, in load_ob
ject
mod = import_module(module)
File "c:\python27\lib\importlib\__init__.py", line 37, in import_module
__import__(name)
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\s3.py", li
ne 6, in <module>
from .http import HTTPDownloadHandler
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http.py",
line 5, in <module>
from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py"
, line 15, in <module>
from scrapy.xlib.tx import Agent, ProxyAgent, ResponseDone, \
File "c:\python27\lib\site-packages\scrapy\xlib\tx\__init__.py", line 3, in <m
odule>
from twisted.web import client
File "c:\python27\lib\site-packages\twisted\web\client.py", line 42, in <modul
e>
from twisted.internet.endpoints import TCP4ClientEndpoint, SSL4ClientEndpoin
t
File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 34, i
n <module>
from twisted.internet.stdio import StandardIO, PipeAddress
File "c:\python27\lib\site-packages\twisted\internet\stdio.py", line 30, in <m
odule>
from twisted.internet import _win32stdio
File "c:\python27\lib\site-packages\twisted\internet\_win32stdio.py", line 7,
in <module>
import win32api
exceptions.ImportError: DLL load failed: The specified module could not be found
.
2015-10-06 09:12:53 [twisted] CRITICAL:
这是我的代码:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
#from scrapy.conf import settings
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://www.g-gmi.si/gmiweb/",
]
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//@href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
#If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
if url.endswith((
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
)):
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['?', '%', '&', '#']):
continue
#Ignore ftp.
if url.startswith("ftp"):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
if any(x in url for x in [
'zaposlovanje',
'Zaposlovanje',
'zaposlitev',
'Zaposlitev',
'zaposlitve',
'Zaposlitve',
'zaposlimo',
'Zaposlimo',
'kariera',
'Kariera',
'delovna-mesta',
'delovna_mesta',
'pridruzi-se',
'pridruzi_se',
'prijava-za-delo',
'prijava_za_delo',
'oglas',
'Oglas',
'iscemo',
'Iscemo',
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
]):
#This is additional filter, suggested by Dan Wu, to improve accuracy. We will check the text of the url as well.
texts = response.xpath('//a[@href="%s"]/text()' % url_xpath).extract()
#1. Texts are empty.
if texts == []:
print "Ni teksta za url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
#item["text"] = text
item["url"] = url
#We return the item.
yield item
# 2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for text in texts:
if any(x in text for x in [
'zaposlovanje',
'Zaposlovanje',
'zaposlitev',
'Zaposlitev',
'zaposlitve',
'Zaposlitve',
'zaposlimo',
'Zaposlimo',
'ZAPOSLIMO',
'kariera',
'Kariera',
'delovna-mesta',
'delovna_mesta',
'pridruzi-se',
'pridruzi_se',
'oglas',
'Oglas',
'iscemo',
'Iscemo',
'ISCEMO',
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
]):
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["text"] = text
item["url"] = url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
# We run the programme in the command line with this command:
# scrapy crawl jobs -o jobs.csv -t csv --logfile log.txt
# We get two output files
# 1) jobs.csv
# 2) log.txt
# Then we manually put one of employment urls from jobs.csv into read.py
如果你能就如何运行这件事提出一些建议,我会很高兴的。谢谢Marko
答案 0 :(得分:6)
您应该始终将内容安装到virtualenv中。一旦你获得了virtualenv并且它是活跃的,请执行:
pip install --upgrade twisted pypiwin32
你应该得到在Windows平台上使Twisted支持stdio的依赖性。
要获得所有你可能会尝试的好东西
pip install --upgrade twisted[windows_platform]
但如果您尝试使用gmp.h
,可能会遇到问题,并且您不需要将其中的大部分内容用于执行您尝试执行的操作。