BeautifulSoup在输出时递归地解析数据和维护结构

时间:2017-03-09 17:14:22

标签: python json parsing beautifulsoup html-parsing

我正在尝试创建一个json文件,该文件可以打破树结构中所有类别项目的列表,并维护类别所在的嵌套顺序(来自此网站http://www.isoldwhat.com/getcats/fullcategorytree.php)。目前我有以下代码来解析所有类别:

#!/usr/bin/env python

import sys
import urllib2
from pprint import pprint
from bs4 import BeautifulSoup

def dataList(element):
    categoryList = []
    try:
        for ul in categorySoup('ul', recursive=True):
            for li in ul('li', recursive=True):
                categoryList.append(li.a.contents)
            categoryList.append("new ccategory");


        return categoryList
    except:
        return ['broken!']

categories = ['20081', '550', '2984', '267', '12576', '625', '15032', '11450', '11116', '1', '58058', '293', '14339', '237', '11232', '45100', '99', '172008', '26395', '11700', '281', '11233', '619', '1281', '870', '10542', '316', '888', '64482', '260', '1305', '220', '3252', '1249']

print "\nSetting user agent...",
user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
print "DONE"

print "Setting headers...",
headers = { 'User-Agent' : user_agent }
print "DONE"

data = {}

print "Iterating through dictionary of categories\n"
for rootID in categories:
    print "Requesting source code...",
    url = 'http://www.isoldwhat.com/getcats/fullcategorytree.php?RootID=%s' % rootID
    req = urllib2.Request(url, None, headers)
    response = urllib2.urlopen(req)
    print "DONE"

    print "Turning HTML into soup..."
    text = response.read()
    soup = BeautifulSoup(text, 'html.parser')
    categorySoup = soup.find('div', id='catnumbers')
    print "DONE"

    print "Parsing data...",
    pprint(dataList(categorySoup))
    print "DONE\n"

    response.close() # its always safe to close an open connection
    sys.exit()

print "Turning data into JSON...",
#data = find_li(soup)
data = json.dumps(data, ensure_ascii=False)
print "DONE\n"

print "Finished doing. Enjoy!"

此代码的问题在于它不维护我需要的嵌套树结构。如何在保持嵌套类别的同时解析类别?

1 个答案:

答案 0 :(得分:0)

例如,您可以使用soup = BeautifulSoup("<b></b>")创建新根。并且只递归地追加类别和标签,同时保持相同的结构。有点像:

def getCategory(root):
    children = root.contents
    if len(children) == 0:
        //returns an empty element of the same class
        //you can save other info here if you want, like the category
        return root.new_tag(root.name) 
    else:
        return root.append(getCategory(e) for e in children))

希望这能解决它:)