加载JSON文件,更正html并加载到BeautifulSoup中

时间:2013-10-20 16:21:13

标签: python json python-2.7 beautifulsoup

我正在尝试通过BeautifulSoup处理json文件,但不知道如何实现这个...

下面是json的副本,我正在尝试浏览json中的每个id并提取某些数据......是否有人建议使用不同的路径?

{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        { 
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}

提前致谢 - Hyflex

4 个答案:

答案 0 :(得分:3)

我很确定这可以满足您的需求 - 对于每一行,它会将'text'属性加载到BeautifulSoup中,然后拉出您可能想要的所有属性。你可以将它概括为你想要的任何行为 - 应该是非常可读的。

import json
try:
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup
myjson = r"""{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        { 
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             { 
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}"""

data = json.loads(myjson)

for l in data['line']:
    soup = BeautifulSoup(l['text'])
    #print soup.prettify()
    # Get the H1 ID
    print soup.findAll('h1')[0]['id']
    # Get the text
    print soup.findAll('h1')[0].contents[0].strip()
    # Get the <a> href
    print soup.findAll('a')[0]['href']
    # Get the <a> class
    print soup.findAll('a')[0]['class']
    # Get the <a> text
    print soup.findAll('a')[0].contents[0].strip()

答案 1 :(得分:2)

您无法使用BeautifulSoup处理json数据。您可以使用json模块,如下所示:

import json
from pprint import pprint

json_data = r"""
{
    "line_type":"Test",
    "title":"Test Test Test",
    "timestamp":"201310200000",
    "line": [
                                        {
            "id":10,
            "text": "<h1 id=\"r021\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":9,
            "text": "<h1 id=\"r023\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":8,
            "text": "<h1 id=\"r024\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":7,
            "text": "<h1 id=\"r026\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":6,
            "text": "<h1 id=\"r027\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":5,
            "text": "<h1 id=\"r028\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":4,
            "text": "<h1 id=\"r029\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":3,
            "text": "<h1 id=\"r031\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             {
            "id":2,
            "text": "<h1 id=\"r032\">\n        Titles here    <\/h3>\n\n            <a href=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                } ,                                             **{
            "id":1,
            "text": "<h1 id=\"r035\">\n        Titles here    <\/h3>\n\n            <a hre**f=\"\/restofthewebsite\/here\" class=\"but-cl1\">Link<\/a>\n        \n"                }                     ]
}
"""

s = json.loads(json_data)

# Getting the value of the ids
for i in xrange(0, 10):
    pprint(s['line'][i]['text'])

工作链接here。您可能会收到ValueError,因为您忘记将r放在字符串声明的前面。

# Imports
import json
from pprint import pprint
from bs4 import BeautifulSoup

json_data = <as described above>
s = json.loads(json_data)
list_of_html_in_json = [s['line'][i]['text'] for i in xrange(10)]
soup = BeautifulSoup(" ".join(list_of_html_in_json))
print soup.find_all("h1", {"id": "r035"})  # Example

我担心因为它使用外部库(bs4),所以我无法向您展示代码的在线版本。但是,我向你保证,我已经尝试并测试了它。

答案 2 :(得分:1)

只是我的尝试:

import requests
import json
from bs4 import BeautifulSoup

# Use requests library to get the JSON data
JSONDATA = requests.request("GET", "http://www.websitehere.com/") #Make sure you include the http part
# Load it with JSON 
JSONDATA = JSONDATA.json()

# Cycle through each `line` in the JSON
for line in JSONDATA['line']:
    # Load stripped html in BeautifulSoup
    soup = BeautifulSoup(line['text'])
    # Prints tidy html
    print soup.prettify()

希望有所帮助:)

答案 3 :(得分:0)

对于最新的beautifulsoup包,它现在是

from bs4 import BeautifulSoup

当您尝试按Christian Ternus

运行上述脚本时,这将帮助您避免遇到麻烦