我正在尝试通过BeautifulSoup处理json文件,但不知道如何实现这个...
下面是json的副本,我正在尝试浏览json中的每个id并提取某些数据......是否有人建议使用不同的路径?
{
"line_type":"Test",
"title":"Test Test Test",
"timestamp":"201310200000",
"line": [
{
"id":10,
"text": "<h1 id=\"r021\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":9,
"text": "<h1 id=\"r023\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":8,
"text": "<h1 id=\"r024\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":7,
"text": "<h1 id=\"r026\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":6,
"text": "<h1 id=\"r027\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":5,
"text": "<h1 id=\"r028\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":4,
"text": "<h1 id=\"r029\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":3,
"text": "<h1 id=\"r031\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":2,
"text": "<h1 id=\"r032\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":1,
"text": "<h1 id=\"r035\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } ]
}
提前致谢 - Hyflex
答案 0 :(得分:3)
我很确定这可以满足您的需求 - 对于每一行,它会将'text'属性加载到BeautifulSoup中,然后拉出您可能想要的所有属性。你可以将它概括为你想要的任何行为 - 应该是非常可读的。
import json
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
myjson = r"""{
"line_type":"Test",
"title":"Test Test Test",
"timestamp":"201310200000",
"line": [
{
"id":10,
"text": "<h1 id=\"r021\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":9,
"text": "<h1 id=\"r023\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":8,
"text": "<h1 id=\"r024\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":7,
"text": "<h1 id=\"r026\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":6,
"text": "<h1 id=\"r027\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":5,
"text": "<h1 id=\"r028\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":4,
"text": "<h1 id=\"r029\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":3,
"text": "<h1 id=\"r031\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":2,
"text": "<h1 id=\"r032\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":1,
"text": "<h1 id=\"r035\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } ]
}"""
data = json.loads(myjson)
for l in data['line']:
soup = BeautifulSoup(l['text'])
#print soup.prettify()
# Get the H1 ID
print soup.findAll('h1')[0]['id']
# Get the text
print soup.findAll('h1')[0].contents[0].strip()
# Get the <a> href
print soup.findAll('a')[0]['href']
# Get the <a> class
print soup.findAll('a')[0]['class']
# Get the <a> text
print soup.findAll('a')[0].contents[0].strip()
答案 1 :(得分:2)
您无法使用BeautifulSoup
处理json数据。您可以使用json
模块,如下所示:
import json
from pprint import pprint
json_data = r"""
{
"line_type":"Test",
"title":"Test Test Test",
"timestamp":"201310200000",
"line": [
{
"id":10,
"text": "<h1 id=\"r021\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":9,
"text": "<h1 id=\"r023\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":8,
"text": "<h1 id=\"r024\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":7,
"text": "<h1 id=\"r026\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":6,
"text": "<h1 id=\"r027\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":5,
"text": "<h1 id=\"r028\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":4,
"text": "<h1 id=\"r029\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":3,
"text": "<h1 id=\"r031\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , {
"id":2,
"text": "<h1 id=\"r032\">\n Titles here <\/h3>\n\n <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } , **{
"id":1,
"text": "<h1 id=\"r035\">\n Titles here <\/h3>\n\n <a hre**f=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n \n" } ]
}
"""
s = json.loads(json_data)
# Getting the value of the ids
for i in xrange(0, 10):
pprint(s['line'][i]['text'])
工作链接here。您可能会收到ValueError
,因为您忘记将r
放在字符串声明的前面。
# Imports
import json
from pprint import pprint
from bs4 import BeautifulSoup
json_data = <as described above>
s = json.loads(json_data)
list_of_html_in_json = [s['line'][i]['text'] for i in xrange(10)]
soup = BeautifulSoup(" ".join(list_of_html_in_json))
print soup.find_all("h1", {"id": "r035"}) # Example
我担心因为它使用外部库(bs4),所以我无法向您展示代码的在线版本。但是,我向你保证,我已经尝试并测试了它。
答案 2 :(得分:1)
只是我的尝试:
import requests
import json
from bs4 import BeautifulSoup
# Use requests library to get the JSON data
JSONDATA = requests.request("GET", "http://www.websitehere.com/") #Make sure you include the http part
# Load it with JSON
JSONDATA = JSONDATA.json()
# Cycle through each `line` in the JSON
for line in JSONDATA['line']:
# Load stripped html in BeautifulSoup
soup = BeautifulSoup(line['text'])
# Prints tidy html
print soup.prettify()
希望有所帮助:)
答案 3 :(得分:0)