使用scrapy从python中的MS word文件中提取文本

时间:2014-09-05 12:47:47

标签: windows python-2.7 ms-word scrapy screen-scraping

以下是带有scthon代码的示例代码,使用python从网站中提取word.doc和docx文件。

import StringIO
from functools import partial
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pyPdf import PdfFileReader
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import Spider
from scrapy.selector import Selector

import urlparse
from miette import DocReader    
from scrapy.item import Item, Field

class wordSpiderItem(Item):

    link = Field()
    title = Field()
    Description = Field()

class wordSpider(CrawlSpider):

    name = "penyrheol"

    # Stay within these domains when crawling
    allowed_domains = ["penyrheol-comp.net"]
    start_urls = ["http://penyrheol-comp.net/vacancy"]


    def parse(self,response):
##          hxs = Selector(response) 
            listings = response.xpath('//div[@class="entry-content"]')
            links = []

    #scrap listings page to get listing links
            for listing in listings: 
                link=listing.xpath('//div[@class="afi-document-
                                    link"]/a/@href').extract()

                links.extend(link)

    #parse listing url to get content of the listing page

            for link in links: 
                item=wordSpiderItem()
                item['link']=link
                if "doc" in link:
                    yield Request(urlparse.urljoin(response.url, link),
                    meta={'item':item},callback=self.parse_data)



        def parse_data(self, response):
            #hxs = Selector(response)

            job = wordSpiderItem()
            job['link'] = response.url
            stream = StringIO.StringIO(response.body)
            reader = DocReader(stream)
            for page in reader.pages:
               job['Description'] = page.extractText()

               return job

我收到以下错误,请检查并让我知道如何使用此代码实现....谢谢。

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.
C:\Documents and Settings\sureshp>cd D:\Final\penyrheolcomp
C:\Documents and Settings\sureshp>d:
D:\Final\penyrheolcomp>scrapy crawl penyrheol -o testd.json -t json
2014-09-05 17:49:55+0530 [scrapy] INFO: Scrapy 0.24.2 started (bot: penyrheolcom
p)
2014-09-05 17:49:55+0530 [scrapy] INFO: Optional features available: ssl, http11
2014-09-05 17:49:55+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'penyrheolcomp.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['penyrheolc
omp.spiders'], 'FEED_URI': 'testd.json', 'BOT_NAME': 'penyrheolcomp'}
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2014-09-05 17:49:56+0530 [scrapy] INFO: Enabled item pipelines:
2014-09-05 17:49:56+0530 [penyrheol] INFO: Spider opened
2014-09-05 17:49:56+0530 [penyrheol] INFO: Crawled 0 pages (at 0 pages/min), scr
aped 0 items (at 0 items/min)
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2014-09-05 17:49:56+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-05 17:49:57+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/vacancy> (referer: None)
2014-09-05 17:49:59+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc> (referer: htt
p://penyrheol-comp.net/vacancy)
2014-09-05 17:49:59+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Application-form-Teaching.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-English-0.6-Temp
orary.doc> (referer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Information-pack-Teacher-of-Englis
h-0.6-Temporary.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:00+0530 [penyrheol] DEBUG: Crawled (200) <GET http://penyrheol-
comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6.doc> (ref
erer: http://penyrheol-comp.net/vacancy)
2014-09-05 17:50:00+0530 [penyrheol] ERROR: Spider error processing <GET http://
penyrheol-comp.net/wp-content/uploads/2014/05/Advert-Teacher-of-English-Temp-0.6
.doc>
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 12
01, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 82
4, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 3
83, in callback
self._startRunCallbacks(result)
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 4
91, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\defer.py", line 5
78, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "penyrheolcomp\spiders\penyrheolcompnet.py", line 58, in parse_da
ta
reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instan
ce found
2014-09-05 17:50:01+0530 [penyrheol] INFO: Closing spider (finished)
2014-09-05 17:50:01+0530 [penyrheol] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1208,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 677942,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 9, 5, 12, 20, 1, 140000),
'log_count/DEBUG': 6,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/TypeError': 3,
'start_time': datetime.datetime(2014, 9, 5, 12, 19, 56, 234000)}
2014-09-05 17:50:01+0530 [penyrheol] INFO: Spider closed (finished)
D:\Final\penyrheolcomp>`enter code here`

1 个答案:

答案 0 :(得分:0)

您的错误隐藏在您的堆栈跟踪中:

reader = DocReader(stream)
File "build\bdist.win32\egg\miette\doc.py", line 23, in __init__
File "build\bdist.win32\egg\cfb\__init__.py", line 23, in __init__
exceptions.TypeError: coercing to Unicode: need string or buffer, instance found

根据https://github.com/rembish/Miette/blob/master/miette/doc.py,DocReader __init__会获取您想要的文档的文件名 - 而不是其正文。

要解决此问题,您可以将response.body写入临时文件,然后将DocReader指向该临时文件。