提取bs4元素中的文本

时间:2018-06-21 13:00:49

标签: python beautifulsoup

我正在尝试从以下bs4元素(下面的示例)中提取一些数据,特别是建立一个循环,以从中提取所有公司名称(可能还有位置):

<form action="action.htm" method="post" id="myform">
  <textarea name="editor" id="editor" class="codemirror-area"></textarea>
  <button type="submit">Save</button>   
</form>

<script>
 var editor = CodeMirror.fromTextArea(document.getElementById("editor"), {
    lineNumbers: true,
    autoCloseTags: true,
    setSize: (200,200),
    indentWithTabs: true,
    theme: "default",   
    lineWrapping: true,         
    extraKeys: {
      "F11": function(cm) {
      cm.setOption("fullScreen", !cm.getOption("fullScreen"));
      },
      "Esc": function(cm) {
      if (cm.getOption("fullScreen")) cm.setOption("fullScreen", false);
      },
      "Ctrl-S": function(instance) { 
      saveText(instance.getValue()); },
      }      
     }
   });
 </script>

名称是“ Hak Industrial ...”字符串。

输出:两个列表,例如

    [<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <div class="field-content"><div class="wrapper hidden">
 <p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
 Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
 Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
 Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisië<br/>blabla useless data<br/><hr/>
 Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
 </div>
 <a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>]

[Hak Industrial Services B.V., Hak Industrial Services B.V., Hak Industrial Services Middle East LLC, Hak Industrial Services SEA Sdn. Bhd., Hak Industrial Services USLLC]

有人会在bs4中知道如何做吗?

预先感谢

2 个答案:

答案 0 :(得分:0)

我最近必须完成一个与此相似的目标。我建立了一个函数来解析电子邮件中的HTML。像这样;

from bs4 import BeautifulSoup as bs

def parser(data):
    # this will parse the data from ticket and create a list.
    html = data
    parsed = bs(html, "lxml")
    data = [line.strip() for line in parsed.stripped_strings]
    print data

传入HTML将为您提供这样的输出;

[u'[', u'Nevenvestiging:', u'Hak Industrial Services B.V., Hoogeveen', u'Nederland', u'blabla useless data', u'Hak Industrial Services B.V., Nieuw Heeten', u'Nederland', u'blabla useless data', u'Hak Industrial Services Middle East LLC, Abu Dhabi', u'Verenigde Arabische Emiraten', u'blabla useless data', u'Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor', u'Maleisi\xeb', u'blabla useless data', u'Hak Industrial Services USLLC, Houston', u'Verenigde Staten van Amerika', u'blabla useless data', u'Toon nevenvestigingen', u']']

您可能可以重构一下,使其更符合您的需求,但是我希望这可以为您指明正确的方向。

答案 1 :(得分:0)

数据必须保留哪种格式?我试着分析一下。

# coding: utf-8
from __future__ import unicode_literals
from bs4 import BeautifulSoup
from bs4 import NavigableString, Tag

html = """<div class="views-field views-field-field-overigeonderdelen"> <span class="views-label views-label-field-overigeonderdelen">Nevenvestiging: </span> <
 <p>Hak Industrial Services B.V., Hoogeveen<br/>Nederland<br/> blabla useless data<br/></p><hr/>
  Hak Industrial Services B.V., Nieuw Heeten<br/>Nederland<br/>blabla useless data<br/><hr/>
   Hak Industrial Services Middle East LLC, Abu Dhabi<br/>Verenigde Arabische Emiraten<br/>blabla useless data<br/><hr/>
    Hak Industrial Services SEA Sdn. Bhd., Petaling Jaya, Selangor<br/>Maleisië<br/>blabla useless data<br/><hr/>
     Hak Industrial Services USLLC, Houston<br/>Verenigde Staten van Amerika<br/>blabla useless data<br/><hr/>
      </div>
       <a class="toggle" href="#">Toon nevenvestigingen</a></div> </div>"""

if __name__ == "__main__":
    soup = BeautifulSoup(html, "lxml")
    companies = []
    for child in soup.find("div", class_ = "wrapper hidden").contents:
        siblings = []
        if isinstance(child, Tag):
            if child.name == "hr":
                previous = child.previous_sibling
                if previous:
                    siblings.append(previous)
                while previous:
                     if isinstance(previous, Tag) and previous.name != "hr" or isinstance(previous, NavigableString):
                         siblings.append(previous)
                         previous = previous.previous_sibling
                     else:
                         previous = False


                print siblings[::-1]