所以,我是网络抓取工具的新手,我在抓取一个简单的JSON文件并从中检索链接时遇到了一些困难。我正在使用scrapy框架来尝试实现这一目标。
我的JSON示例文件:
{
"pages": [
{
"address":"http://foo.bar.com/p1",
"links": ["http://foo.bar.com/p2",
"http://foo.bar.com/p3", "http://foo.bar.com/p4"]
},
{
"address":"http://foo.bar.com/p2",
"links": ["http://foo.bar.com/p2",
"http://foo.bar.com/p4"]
},
{
"address":"http://foo.bar.com/p4",
"links": ["http://foo.bar.com/p5",
"http://foo.bar.com/p1", "http://foo.bar.com/p6"]
},
{
"address":"http://foo.bar.com/p5",
"links": []
},
{
"address":"http://foo.bar.com/p6",
"links": ["http://foo.bar.com/p7",
"http://foo.bar.com/p4", "http://foo.bar.com/p5"]
}
]
}
我的items.py文件
import scrapy
from scrapy.item import Item, Field
class FoobarItem(Item):
# define the fields for your item here like:
title = Field()
link = Field()
我的蜘蛛文件
from scrapy.spider import Spider
from scrapy.selector import Selector
from foobar.items import FoobarItem
class MySpider(Spider):
name = "foo"
allowed_domains = ["localhost"]
start_urls = ["http://localhost/testdata.json"]
def parse(self, response):
yield response.url
最终我想抓取文件并返回对象中的链接而没有重复,但是现在我甚至都在努力抓取json。我以为上面的代码会爬过json对象并返回链接,但我的输出文件是空的。不确定我做错了什么,但任何帮助都会受到赞赏
答案 0 :(得分:-1)
首先,你需要有一种方法来解析json文件,json
lib应该做得很好。接下来的一点就是使用url运行你的爬虫。
import json
with open("myExample.json", 'r') as infile:
contents = json.load(infile)
#contents is now a dictionary of your json but it's a json array/list
#continuing on you would then iterate through each dictionary
#and fetch the pieces you need.
links_list = []
for item in contents:
for key, val in item.items():
if 'http' in key:
links_list.append(key)
if 'http' in value:
if isinstance(value, list):
for link in value:
links_list.append(link)
else:
links_list.append(value)
#get rid of dupes
links_list = list(set(links_list))
#do rest of your crawling with list of links