Question

好的，我试图得到的数据看起来像这样;

  <li class="expandable"> Criminal
    <ul class="subPracticeAreas" style="display:none">
        <li> Appellate< /li>
        <li>Crimes against the person</li>
        <li> Drugs< /li>
        <li>Environmental and planning offences</li>
        <li> Extradition< /li>
        <li>Fraud</li>
        <li> Juvenile justice</li>
        <li>Mental illness</li>
        <li> Proceeds of crime / money laundering</li>
        <li>Property offences</li>
        <li> Sexual assault</li>
        <li>Traffic</li>
        <li> White collar and corporate crime</li>
        <li>Work health and safety</li>
    </ul>
  </li>
  <li class="expandable"> Appellate
    <ul class="subPracticeAreas" style="display:none">
        <li> Civil appeals</li>
        <li>Criminal appeals</li>
    </ul>
  </li>
  <li class="expandable"> Inquests / inquiries
    <ul class="subPracticeAreas" style="display:none">
        <li> Commissions and other Inquiries</li>
        <li>Coronial inquests</li>
    </ul>
  </li>

所以我希望能够实现这些目标;

抓住父li标签的文本，将其存储为变量（用作字典键），例如在第一个列表中，我只想抓住＆＃34; Criminal＆＃34;。
抓住每个孩子li标签的文本（单独粗略），将其存储为带有密钥＆＃34;犯罪＆＃34; （如上所述）。

每个li类的冲洗和重复过程=＆＃34;可扩展＆＃34;部分。

到目前为止我所做的事情（正如你想象的那样无效）;

aop_list_headers = page_soup.findAll("li",{"class":"expandable"})

for aop_list in aop_list_headers:
    aop_key_name = aop_li_head.getText().strip()

因此，这将返回相应父li的所有文本（例如，对于上述循环的第一次迭代，我得到以下内容;

CriminalAppellateCrimes against the personDrugsEnvironmental and planning offencesExtraditionFraudJuvenile justiceMental illnessProceeds of crime/money launderingProperty offencesSexual assaultTrafficWhite collar and corporate crimeWork health and safety

我如何阻止这一点通过每篇文章（因为我看到它正在发生，因为父母李绕着整个列表...

我没有包括我将如何实现第二个目标（如上所述），因为我坚持第一个目标......

非常感谢所有帮助。先谢谢你。

Answer 1

您可以使用递归标记通过find_all访问预期dict密钥的所有子元素：

children = soup.find_all("li", { "class" : "expandable" }, recursive=False)
for child in children:
   print child.getText()

或者，您可以获取其父（ul）的父级具有“可扩展”类的所有li文本元素

def get_children(elem):
    return (tag.name == 'li' and
        tag.parent.parent.name == 'li' and
        'expandable' in tag.parent.parent['class'])

for child in soup.find_all(get_children):
    print child.getText() #li text

Answer 2

我最终在BeautifulSoup中使用了extend（）函数，就像这样;

[_textfield becomeFirstResponder];

[_textfield addTarget:self action:@selector(backAction:) 
forControlEvents:UIControlEventEditingDidEndOnExit];

[_back addTarget:self action:@selector(backAction:) 
forControlEvents:UIControlEventTouchUpInside];

- (void)backAction:(id)sender
{
    [users addObject:_textfield.text];
    _textfield.text = nil;
    [_textfield becomeFirstResponder];
}

因此转过来;

for html in html_list:
    # Storing the unwanted child element
    unwanted = html.find("ul",{"class":""subPracticeAreas""})
    # Extracting the child <ul> data
    unwanted.extract()

进入这个;

<li class="expandable"> Criminal
  <ul class="subPracticeAreas" style="display:none">
    <li> Appellate< /li>
    <li>Crimes against the person</li>
    <li> Drugs< /li>
    <li>Environmental and planning offences</li>
    <li> Extradition< /li>
    <li>Fraud</li>
    <li> Juvenile justice</li>
    <li>Mental illness</li>
    <li> Proceeds of crime / money laundering</li>
    <li>Property offences</li>
    <li> Sexual assault</li>
    <li>Traffic</li>
    <li> White collar and corporate crime</li>
    <li>Work health and safety</li>
  </ul>
</li>

因此请留下我需要收集的父

元素。

要完成原始评论中提到的两项任务，我使用了以下代码。

  <li class="expandable"> Criminal </li>

感谢大家的投入！

干杯

美丽的汤，不包括父<li>标签上的内部<li>和<ul>标签.getText（）

2 个答案: