加速解析器:将HTML导入数据库

时间:2018-07-18 19:02:33

标签: python html beautifulsoup

我需要将所有html标记和属性插入数据库

el.driver.get(url_page)
txthtml = el.driver.page_source
soup = BeautifulSoup(txthtml, "html.parser")
body = soup.find('html')
html_parse(body, el, url_page_id, 0, 0, 0,url_page)

def html_parse(html, el, url_page_id, level, i, parent_id, url_page):
    txt = ""
    if len(html.text) > 0:
       txt = html.text.replace("\n","").replace("\t","").replace("\r","")
    ta = tag_list()
    ta.p_id = el.id
    ta.page_id = url_page_id
    ta.level = level
    ta.number = i
    ta.txt = txt
    ta.name = html.name
    ta.parent_id = parent_id
    ta.html = str(html)
    ta.save()
    insert_attr(html, el.id, url_page_id, ta.id, url_page)
    children = list(html.children)
    j = 0
    for child in children:
        if child.name is None:
            continue
        j = j + 1
        html_parse(child, el, url_page_id, level + 1, j, ta.id, url_page)

当我具有递归函数html_parse

  • html-当前的html对象
  • el-驱动程序类
  • url_page_id-页面ID
  • 级别-DOM中的级别
  • i-子代号码
  • parent_id-父级的ID
  • url_page-当前网址
  • tag_list-插入当前标签
  • insert_attr-插入标签的数据库属性

每个html_parse函数运行都很快,但是完整的html解析每个大html页面运行大约4-5分钟。

如何加快代码的速度?

0 个答案:

没有答案