我对python有点新鲜,我尝试使用Beautiful Soup废弃页面并以JSON格式输出结果。 SimpleJson
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import json as simplejson
webpages = (
"page1.html",
"page2.html",
"page3.html"
)
my_dict = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
print simplejson.dumps(my_dict,indent=4)
我只收到最后一页的结果?有人能告诉我哪里出错了吗?
答案 0 :(得分:3)
每次循环都会覆盖字典。选中print
语句,使其包含在for
循环中:
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
print simplejson.dumps(my_dict,indent=4)
答案 1 :(得分:1)
results = [] # you need a list to collect all dictionaries
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
this_dict = {}
this_dict['title'] = soup.title.string
this_dict['body'] = soup.find(id="bodyText")
results.append(this_dict)
print simplejson.dumps(results, indent=4)
然而,我有一种感觉,你想要的是一本字典,其中键是页面的标题,值是正文:
results = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
results[soup.title.string] = soup.find(id='bodyText')
print simplejson.dumps(results, indent=4)
或使用理解:
soups = (BeautifulSoup(open(webpage)) for webpage in webpages)
results = {soup.title.string: soup.find(id='bodyText') for soup in soups}
print simplejson.dumps(results, indent=4)
PS。请原谅我的错误,如果有的话,我正在用手机写字......
答案 2 :(得分:0)
由于您在每次迭代中销毁标题和正文,因此有两种处理方法:
创建所有词典的列表:
all_dict=[]
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
all_dict.append(my_dict)
for my_dict in alldict:
print simplejson.dumps(my_dict,indent=4)
使用enumerate()
使用迭代编号创建不同的标题和正文名称,例如 title1,body1,title2,body2 ,等等。这样您就可以在同一个字典中保存每个标题和正文名称:
for i,webpage in enumerate(webpages):
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'+str(i)] = title
my_dict['body'+str(i)]= str(body)
print simplejson.dumps(my_dict,indent=4)
答案 3 :(得分:-2)
缩进可以在python中引起奇迹,只需要在for循环中缩进最后一行
from bs4 import BeautifulSoup
import json as simplejson
webpages = (
"page1.html",
"page2.html",
"page3.html"
)
my_dict = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = title
my_dict['body']= str(body)
print simplejson.dumps(my_dict,indent=4)
或者如果您真的想要一个文件中的所有数据,那么您可以尝试:
my_dict['title'] = my_dict.get("title","")+","+title
my_dict['body']= my_dict.get("body","")+","+body
所以代码可能如下:
from bs4 import BeautifulSoup
import json as simplejson
webpages = (
"page1.html",
"page2.html",
"page3.html"
)
my_dict = {}
for webpage in webpages:
soup = BeautifulSoup(open(webpage))
title = soup.title.string
body = soup.find(id="bodyText")
my_dict['title'] = my_dict.get("title",[]).append(title)
my_dict['body']= my_dict.get("body",[]).append(body)
print simplejson.dumps(my_dict,indent=4)