我使用beautifulsoup4从网站页面抓取数据,并将抓取结果保存到这样的列表字典中:
DATA = [
TITLE, {
'IMAGES': IMAGE,
'URL_VIDEOS': URL_VIDEOS,
'DESCRIPTIONS': DESCRIPTIONS,
'SYNOPSIS': SYNOPSIS
}
]
是值 IMAGE , URL_VIDOES , DESCRIPTIONS 和 SYNOPSIS 的值,用于变量抓取结果。< / p>
,然后尝试使用以下代码将变量数据保存到 .json 文件扩展名:
json_file = open('result.json', 'w')
json.dump(DATA, json_file)
json_file.close()
我得到这样的结果:
["Action Fruits", {"IMAGES": "http://animeindo.video/wp-content/uploads/2017/07/rsz_heroin.jpg", "URL_VIDEOS": "http://www.mp4upload.com/embed-q7xxgge1yu1c.html", "DESCRIPTIONS": {"Japanese": " \u30a2\u30af\u30b7\u30e7\u30f3\u30d2\u30ed\u30a4\u30f3 \u30c1\u30a2\u30d5\u30eb\u30fc\u30c4", "\nProducer": " Diomedea", "\nType": " TV Series", "\nStatus": " Ongoing", "\nGenre": " Comedy, School, Slice of Life", "\nDurasi": " 24 min", "\nEpisode": " \u2013", "\nRating": " 6.11", "\nAdded On": " July 12th, 2017"}, "SYNOPSIS": "Japanese: \u30a2\u30af\u30b7\u30e7\u30f3\u30d2\u30ed\u30a4\u30f3 \u30c1\u30a2\u30d5\u30eb\u30fc\u30c4\nProducer: Diomedea\nType: TV Series\nStatus: Ongoing\nGenre: Comedy, School, Slice of Life\nDurasi: 24 min\nEpisode: \u2013\nRating: 6.11\nAdded On: July 12th, 2017\nSinopsis:\nPerjuangan pahlawan lokal dalam menyelamatkan daerahnya.\n"}]
但在该刮擦中的循环中,该 .json 文件中的结果始终会被覆盖,即不会添加新数据,而只会被如下所示的新数据覆盖:
["Happy", {"IMAGES": "https://1.bp.blogspot.com/-SUq5_dpoIlM/VwpKqqsEzNI/AAAAAAAAM50/H81MUyDLZA0ctj8zo8JbuUVPPz4sxQulw/s1600/77219__1460292250_36.80.228.117.jpg", "URL_VIDEOS": "http://www.mp4upload.com/embed-ptj9hmeefar8.html", "DESCRIPTIONS": {"Japanese": " \u3042\u3093\u30cf\u30d4\u266a", "\nProducer": " Silver Link", "\nType": " TV Series", "\nStatus": " Ongoing", "\nGenre": " Comedy, School, Slice of Life", "\nDurasi": " 23 min. per ep.", "\nEpisode": " 12", "\nRating": " 7.06", "\nAdded On": " April 10th, 2016"}, "SYNOPSIS": "Japanese: \u3042\u3093\u30cf\u30d4\u266a\nProducer: Silver Link\nType: TV Series\nStatus: Ongoing\nGenre: Comedy, School, Slice of Life\nDurasi: 23 min. per ep.\nEpisode: 12\nRating: 7.06\nAdded On: April 10th, 2016\nSinopsis:\nMenceritakan kelas 1-7 di Akademi Tennomifune, di mana semua murid yang suka sial berkumpul. Hibari, salah satu murid di kelas ini, bertemu dengan si sial Hanako di hari pertama sekolah, dan bersama-sama mereka berjuang mencari hidup bahagia di sekolah mereka.\n"}]
下一个结果也被覆盖...
我想添加新数据,并使用一个 .json 文件保存所有抓取结果。所以..怎么做..?
答案 0 :(得分:2)
'w'
文件模式将在每次写入文件时重写文件。
'a'
在这里无法正常工作,因为它会导致生成无效的JSON文件。
您应该做的是在抓取(到列表中?)的同时收集结果,然后在循环完数据后将其转储到JSON文件中。
答案 1 :(得分:1)
如何通过IMAGES
选择URL_VIDEOS
或TITLE
?我认为您的json不正确,因为标题是值而不是键,也许应该像这种格式
{
"title A" : {"IMAGES" : "IMAGE A"},
"title B" : {"IMAGES" : "IMAGE B"}
}
或
[
{"Title" : "title A", "IMAGES" : "IMAGE A"},
{"Title" : "title B", "IMAGES" : "IMAGE B"}
]
让我们尝试第一个示例,您需要读取以前的json和update()
以及新数据,首先请确保删除result.json
import os.path
....
DATA = {"Action Fruits": {"IMAGES": "a.jpg", "URL_VIDEOS" : "http://a.mp4"}}
OLD_DATA = {} # set old data to this if file not exist
if os.path.isfile('result.json'):
with open('result.json', 'r') as f:
OLD_DATA = json.load(f)
# {"Happy" : {"IMAGES" : "b.jpg", "URL_VIDEOS" : "http://b.mp4"}}
# concatenate old and new data
DATA.update(OLD_DATA)
with open('result.json', 'w') as f:
json.dump(DATA, f)
result.json
{
"Action Fruits": {"IMAGES": "a.jpg", "URL_VIDEOS": "http://a.mp4"},
"Happy": {"IMAGES": "b.jpg", "URL_VIDEOS": "http://b.mp4"}
}
答案 2 :(得分:0)
我已经通过以下代码解决了这个问题:
with open('result.json', 'a') as outfile:
outfile.write(json.dumps(DATA, sort_keys=True, indent=4))
我从here得到了答案。