我是Python的新手,最近我正在尝试使用Scrapy来抓取一个包含多个页面的网站,以下是来自我的" spider.py"
的代码段 def parse(self, response):
sel = Selector(response)
tuples = sel.xpath('//*[td[@class = "caption"]]')
items = []
for tuple in tuples:
item = DataTuple()
keyTemp = tuple.xpath('td[1]').extract()[0]
key = html2text.html2text(keyTemp).rstrip()
valueTemp = tuple.xpath('td[2]').extract()[0]
value = html2text.html2text(valueTemp).rstrip()
item[key] = value
items.append(item)
return items
使用以下命令运行代码:
scrapy crawl dumbSpider -o items.json -t json
它会给出:
{"a":"a-Value"},
{"b":"b-Value"},
{"c":"c-Value"},
{"a":"another-a-Value"},
{"b":"another-b-Value"},
{"c":"another-c-Value"}
但我真的想要这样的东西:
{"a":"a-Value", "b":"b-Value", "c":"c-Value"},
{"a":"another-a-Value", "b":"another-b-Value", "c":"another-c-Value"}
我尝试了几种方法来调整spider.py,例如使用临时列表来存储所有" item"单个网页然后将临时列表附加到"项目"但不知怎的,它不起作用。
已更新:缩进已修复。
答案 0 :(得分:0)
下面我已经做了一个快速模型,告诉我如何推荐这样做,只要你知道每页的TD数量。您可以根据需要采取部分或全部措施。这可能是针对您的问题而过度设计的(对不起!);你可以只取chunk_by_numbers位并完成....
有几点需要注意:
1)避免使用' tuple'作为变量名称,因为它也是一个内部关键字
2)学会使用生成器/内置插件,因为如果你一次做大量的网站,它们会更快更轻(参见下面的parse_to_kv和chunk_by_number)
3)尝试隔离解析逻辑,如果它发生变化,你可以在一个地方轻松换出(参见下面的extract_td)
4)你的功能不使用' self',所以你应该使用@staticmethod装饰器并从函数中删除这个参数
5)目前输出是一个dict,但如果你需要一个JSON对象,你可以导入json并转储它
def extract_td(item, index):
# extract logic for my websites which allows extraction
# of either a key or value from a table data
# returns a string representation of item[index]
# this is very page/tool specific!
td_as_str = "td[%i]" % index
val = item.xpath(td_as_str).extract()[0]
return html2text.html2text(val).rstrip()
def parse_to_kv(xpaths):
# returns key, value pairs from the given
# this is also page specific
for xpath in xpaths:
yield extract_td(xpath, 0), extract_td(xpath, 1)
def chunk_by_number(alist, num):
# splices alist into chunks of num size.
# This is a very generic, reusable operation
for chunk in list(zip(*(iter(alist),) * num)):
yield chunk
def parse(response, td_per_page):
# extracts key/value pairs based on the table datas in response
# yields lists of length td_per_page which contain these key/value extractions
# this is very specific based on our parse patterns
sel = Selector(response)
tuples = sel.xpath('//*[td[@class = "caption"]]')
kv_generator = parse_to_kv(tuples)
for page in chunk_by_number(kv_generator, td_per_page):
print dict(page)