我试图通过帖子列出scrapy抓取输出列表以进行调试。
这是我的代码:
post_list = []
with open('last_crawl_output.txt','r') as f:
crawl_output = f.read()
# Find first 'referer' that indicates start of scrapy crawl AFTER initial crawl of search results page
iter = re.finditer("referer", crawl_output)
referer_list = [m.start(0) for m in iter]
# Find indicator of crawl finished.
iter2 = re.finditer("scrapy", crawl_output)
closing_list = [m.start(0) for m in iter2]
del referer_list[0]
pos1 = referer_list[0]
for pos1 in referer_list:
# Get largest scrapy index after each referer index.
pos2_index = bisect.bisect(closing_list, pos1)
# Get post from positions.
pos2 = closing_list[pos2_index+1]
post = crawl_output[pos1:pos2-21]
我也尝试使用post_list.append(post)
,但无济于事。
[edit]
这里有一些示例输出。
我要添加到post_list
here
这是我得到的。以下是post_list
添加的帖子:output
当我使用insert时,它会被\n
答案 0 :(得分:0)
我决定像这样解决这个列表问题:
# Splits post by newline, adds to list
post_lines = post.split('\n')
# Add the words "Next Post" to differentiate each post.
post_lines.append('Next Post')
# Print each line, and get perfect formatting.
for line in post_lines:
print line
答案 1 :(得分:0)
更好的解决方案是将帖子添加到字典中。这样可以保持格式化并使用更少的代
post_count = 0
post_dict = {}
for pos1 in referer_list:
post_count += 1
pos2_index = bisect.bisect(closing_list, pos1)
pos2 = closing_list[pos2_index+1]
post = crawl_output[pos1:pos2-21]
post_dict[post_count] = post