美丽的汤,不包括父<li>标签上的内部<li>和<ul>标签.getText()

时间:2017-04-18 23:47:07

标签: python web-scraping beautifulsoup anaconda

好的,我试图得到的数据看起来像这样;

  <li class="expandable"> Criminal
    <ul class="subPracticeAreas" style="display:none">
        <li> Appellate< /li>
        <li>Crimes against the person</li>
        <li> Drugs< /li>
        <li>Environmental and planning offences</li>
        <li> Extradition< /li>
        <li>Fraud</li>
        <li> Juvenile justice</li>
        <li>Mental illness</li>
        <li> Proceeds of crime / money laundering</li>
        <li>Property offences</li>
        <li> Sexual assault</li>
        <li>Traffic</li>
        <li> White collar and corporate crime</li>
        <li>Work health and safety</li>
    </ul>
  </li>
  <li class="expandable"> Appellate
    <ul class="subPracticeAreas" style="display:none">
        <li> Civil appeals</li>
        <li>Criminal appeals</li>
    </ul>
  </li>
  <li class="expandable"> Inquests / inquiries
    <ul class="subPracticeAreas" style="display:none">
        <li> Commissions and other Inquiries</li>
        <li>Coronial inquests</li>
    </ul>
  </li>

所以我希望能够实现这些目标;

  1. 抓住父li标签的文本,将其存储为变量(用作字典键),例如在第一个列表中,我只想抓住&#34; Criminal&#34;。
  2. 抓住每个孩子li标签的文本(单独粗略),将其存储为带有密钥&#34;犯罪&#34; (如上所述)。
  3. 每个li类的冲洗和重复过程=&#34;可扩展&#34;部分。

    到目前为止我所做的事情(正如你想象的那样无效);

    aop_list_headers = page_soup.findAll("li",{"class":"expandable"})
    
    for aop_list in aop_list_headers:
        aop_key_name = aop_li_head.getText().strip()
    

    因此,这将返回相应父li的所有文本(例如,对于上述循环的第一次迭代,我得到以下内容;

    CriminalAppellateCrimes against the personDrugsEnvironmental and planning offencesExtraditionFraudJuvenile justiceMental illnessProceeds of crime/money launderingProperty offencesSexual assaultTrafficWhite collar and corporate crimeWork health and safety
    

    我如何阻止这一点通过每篇文章(因为我看到它正在发生,因为父母李绕着整个列表...

    我没有包括我将如何实现第二个目标(如上所述),因为我坚持第一个目标......

    非常感谢所有帮助。先谢谢你。

2 个答案:

答案 0 :(得分:1)

您可以使用递归标记通过find_all访问预期dict密钥的所有子元素:

children = soup.find_all("li", { "class" : "expandable" }, recursive=False)
for child in children:
   print child.getText()

或者,您可以获取其父(ul)的父级具有“可扩展”类的所有li文本元素

def get_children(elem):
    return (tag.name == 'li' and
        tag.parent.parent.name == 'li' and
        'expandable' in tag.parent.parent['class'])

for child in soup.find_all(get_children):
    print child.getText() #li text

答案 1 :(得分:1)

我最终在BeautifulSoup中使用了extend()函数,就像这样;

[_textfield becomeFirstResponder];

[_textfield addTarget:self action:@selector(backAction:) 
forControlEvents:UIControlEventEditingDidEndOnExit];

[_back addTarget:self action:@selector(backAction:) 
forControlEvents:UIControlEventTouchUpInside];

- (void)backAction:(id)sender
{
    [users addObject:_textfield.text];
    _textfield.text = nil;
    [_textfield becomeFirstResponder];
}

因此转过来;

for html in html_list:
    # Storing the unwanted child element
    unwanted = html.find("ul",{"class":""subPracticeAreas""})
    # Extracting the child <ul> data
    unwanted.extract()

进入这个;

<li class="expandable"> Criminal
  <ul class="subPracticeAreas" style="display:none">
    <li> Appellate< /li>
    <li>Crimes against the person</li>
    <li> Drugs< /li>
    <li>Environmental and planning offences</li>
    <li> Extradition< /li>
    <li>Fraud</li>
    <li> Juvenile justice</li>
    <li>Mental illness</li>
    <li> Proceeds of crime / money laundering</li>
    <li>Property offences</li>
    <li> Sexual assault</li>
    <li>Traffic</li>
    <li> White collar and corporate crime</li>
    <li>Work health and safety</li>
  </ul>
</li>

因此请留下我需要收集的父

  • 元素。

    要完成原始评论中提到的两项任务,我使用了以下代码。

      <li class="expandable"> Criminal </li>
    

    感谢大家的投入!

    干杯