我正在尝试使用Scrapy蜘蛛抓取页面,然后将这些页面以可读的形式保存到.txt文件中。我用来做这个的代码是:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
title = hxs.select('/html/head/title/text()').extract()
content = hxs.select('//*[@id="content"]').extract()
texts = "%s\n\n%s" % (title, content)
soup = BeautifulSoup(''.join(texts))
strip = ''.join(BeautifulSoup(pretty).findAll(text=True))
filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title)
filly = open(filename, "w")
filly.write(strip)
我在这里结合了BeautifulSoup,因为正文包含了我在最终产品(主要是链接)中不需要的大量HTML,因此我使用BS来删除HTML并仅保留文本利益。
这给了我看起来像
的输出[u"School, Chandler's Ford (Hansard, 30 November 1961)"]
[u'
\n \n
HC Deb 30 November 1961 vol 650 cc608-9
\n
608
\n
\n
\n
\n
\xa7
\n
28.
\n
Dr. King
\n
\n asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.\n
\n
\n
\n \n
\n
\n
\n
\xa7
\n
Sir D. Eccles
\n
\n I understand that the authority has paid \xa375,000 for this site.\n \n
虽然我希望输出看起来像:
School, Chandler's Ford (Hansard, 30 November 1961)
HC Deb 30 November 1961 vol 650 cc608-9
608
28.
Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.
Sir D. Eccles I understand that the authority has paid £375,000 for this site.
所以我基本上在寻找如何删除新行指标\n
,收紧所有内容,并将任何特殊字符转换为正常格式。