Question

我对python有点新鲜，我尝试使用Beautiful Soup废弃页面并以JSON格式输出结果。 SimpleJson

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (
    "page1.html",
    "page2.html",
    "page3.html"
)

my_dict = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'] = title
    my_dict['body']= str(body)

print simplejson.dumps(my_dict,indent=4)

我只收到最后一页的结果？有人能告诉我哪里出错了吗？

Answer 1

每次循环都会覆盖字典。选中print语句，使其包含在for循环中：

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

Answer 2

results = [] # you need a list to collect all dictionaries

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))
    this_dict = {}
    this_dict['title'] = soup.title.string
    this_dict['body'] = soup.find(id="bodyText")
    results.append(this_dict)

print simplejson.dumps(results, indent=4)

然而，我有一种感觉，你想要的是一本字典，其中键是页面的标题，值是正文：

results = {}

for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    results[soup.title.string] = soup.find(id='bodyText')

print simplejson.dumps(results, indent=4)

或使用理解：

soups = (BeautifulSoup(open(webpage)) for webpage in webpages)
results = {soup.title.string: soup.find(id='bodyText') for soup in soups}
print simplejson.dumps(results, indent=4)

PS。请原谅我的错误，如果有的话，我正在用手机写字......

Answer 3

由于您在每次迭代中销毁标题和正文，因此有两种处理方法：

创建所有词典的列表：

all_dict=[]
for webpage in webpages:
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'] = title
    my_dict['body']= str(body)
    all_dict.append(my_dict)

for my_dict in alldict:
    print simplejson.dumps(my_dict,indent=4)

使用enumerate()使用迭代编号创建不同的标题和正文名称，例如 title1，body1，title2，body2 ，等等。这样您就可以在同一个字典中保存每个标题和正文名称：

for i,webpage in enumerate(webpages):
    soup = BeautifulSoup(open(webpage))
    title = soup.title.string
    body = soup.find(id="bodyText")
    my_dict['title'+str(i)] = title
    my_dict['body'+str(i)]= str(body)

print simplejson.dumps(my_dict,indent=4)

Answer 4

缩进可以在python中引起奇迹，只需要在for循环中缩进最后一行

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = title
    my_dict['body']= str(body)

    print simplejson.dumps(my_dict,indent=4)

或者如果您真的想要一个文件中的所有数据，那么您可以尝试：

my_dict['title'] = my_dict.get("title","")+","+title
my_dict['body']= my_dict.get("body","")+","+body

所以代码可能如下：

from bs4 import BeautifulSoup
import json as simplejson 

webpages = (

"page1.html",
"page2.html",
"page3.html"

)

my_dict = {}

for webpage in webpages:

    soup = BeautifulSoup(open(webpage))

    title = soup.title.string

    body = soup.find(id="bodyText")

    my_dict['title'] = my_dict.get("title",[]).append(title)
    my_dict['body']= my_dict.get("body",[]).append(body)

print simplejson.dumps(my_dict,indent=4)

Python：添加到字典中以使用for循环输出为json

4 个答案: