Python:添加到字典中以使用for循环输出为json

时间:2014-12-22 14:39:57

标签: python simplejson

我对python有点新鲜,我尝试使用Beautiful Soup废弃页面并以JSON格式输出结果。 SimpleJson

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (
    "page1.html",
    "page2.html",
    "page3.html"
)

my_dict = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'] = title
    my_dict['body']= str(body)

print simplejson.dumps(my_dict,indent=4)

我只收到最后一页的结果?有人能告诉我哪里出错了吗?

4 个答案:

答案 0 :(得分:3)

每次循环都会覆盖字典。选中print语句,使其包含在for循环中:

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

答案 1 :(得分:1)

results = [] # you need a list to collect all dictionaries

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))
    this_dict = {}
    this_dict['title'] = soup.title.string
    this_dict['body'] = soup.find(id="bodyText")
    results.append(this_dict)

print simplejson.dumps(results, indent=4)

然而,我有一种感觉,你想要的是一本字典,其中键是页面的标题,值是正文:

results = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    results[soup.title.string] = soup.find(id='bodyText')

print simplejson.dumps(results, indent=4)

或使用理解:

soups = (BeautifulSoup(open(webpage)) for webpage in webpages)
results = {soup.title.string: soup.find(id='bodyText') for soup in soups}
print simplejson.dumps(results, indent=4)

PS。请原谅我的错误,如果有的话,我正在用手机写字......

答案 2 :(得分:0)

由于您在每次迭代中销毁标题正文,因此有两种处理方法:

  1. 创建所有词典的列表:

    all_dict=[]
    for webpage in webpages:
        soup = BeautifulSoup(open(webpage))
        title = soup.title.string
        body = soup.find(id="bodyText")
        my_dict['title'] = title
        my_dict['body']= str(body)
        all_dict.append(my_dict)
    
    for my_dict in alldict:
        print simplejson.dumps(my_dict,indent=4)
    
  2. 使用enumerate()使用迭代编号创建不同的标题正文名称,例如 title1,body1,title2,body2 ,等等。这样您就可以在同一个字典中保存每个标题正文名称:

    for i,webpage in enumerate(webpages):
        soup = BeautifulSoup(open(webpage))
        title = soup.title.string
        body = soup.find(id="bodyText")
        my_dict['title'+str(i)] = title
        my_dict['body'+str(i)]= str(body)
    
    print simplejson.dumps(my_dict,indent=4)
    

答案 3 :(得分:-2)

缩进可以在python中引起奇迹,只需要在for循环中缩进最后一行

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

或者如果您真的想要一个文件中的所有数据,那么您可以尝试:

my_dict['title'] = my_dict.get("title","")+","+title
my_dict['body']= my_dict.get("body","")+","+body

所以代码可能如下:

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = my_dict.get("title",[]).append(title)
    my_dict['body']= my_dict.get("body",[]).append(body)

print simplejson.dumps(my_dict,indent=4)