使用Beautiful Soup从未正确格式化的标签中抓取文本

时间:2018-03-10 22:08:09

标签: python html beautifulsoup python-requests

我正在使用Beautiful Soup并请求尝试从html页面中抓取文本信息,如本文底部所示。我尝试过使用

judge_record = judge_soup.find("div", {"class": "field__item even"})

然后

result = judge_record.findAll("br")

br标记和粗体标记之间提取文本。

不幸的是,当我这样做时,我才会回来:

[<br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br> Private practice, Washington, D.C., 2003-2006, 2007-2010<br> Private practice, Atlanta, Georgia, 2006-2007<br> Assistant U.S. attorney, Northern District of Georgia, 2010-2014<br/></br></br></br>, <br> Private practice, Atlanta, Georgia, 2006-2007<br> Assistant U.S. attorney, Northern District of Georgia, 2010-2014<br/></br></br>, <br> Assistant U.S. attorney, Northern District of Georgia, 2010-2014<br/></br>, <br/>] [Finished in 1.0s]

这是因为<br>标签没有互补的结束标签吗?

任何建议都将不胜感激。

<div class="field field--name-judge-record-display field--type-ds field--label-hidden">
    <div class="field__items">
        <div class="field__item even">Born 1974  in Madison, WI

            <br><br>
            <b>Federal Judicial Service:</b>

            <br> Judge, U.S. District Court for the Middle District of Georgia</br>
            <br>Nominated by Barack Obama on March 11, 2014, to a seat vacated by W. Louis Sands. Confirmed by the Senate on November 18, 2014, and received commission on November 20, 2014. 
            <br><br>
            <b>Education:</b>

            <br> Brown University, B.A., 1997
            <br>Yale Law School, J.D., 2002

            <br><br>
            <b>Professional Career:</b>

            <br>
            <p>Law clerk, Hon. Marvin J. Garbis, U.S. District Court, District of Maryland, 2002-2003
            <br/>


            Private practice, Washington, D.C., 2003-2006, 2007-2010<br />
            Private practice, Atlanta, Georgia, 2006-2007<br />

            Assistant U.S. attorney, Northern District of Georgia, 2010-2014<br />
            </p>


</div>

1 个答案:

答案 0 :(得分:1)

要获取div标记内的文字,您可以使用get_text()功能。

judge_record = soup.find('div', class_='field__item even')
print(judge_record.get_text(' ', strip=True))

输出:

  

1974年出生于美国威斯康星州麦迪逊联邦司法部门:美国法官   格鲁吉亚中区地方法院由巴拉克提名   奥巴马于2014年3月11日,由W. Louis Sands腾出一个席位。   参议院于2014年11月18日确认并获得佣金   2014年11月20日。教育:布朗大学,B.A。,1997年耶鲁大学法学院   学校,J.D。,2002年职业生涯:律师,Hon。马文·J   马里兰州美国地方法院Garbis,2002-2003私人   实践,华盛顿特区,2003-2006,2007-2010私人执业,   佐治亚州亚特兰大市,2006-2007北区助理美国检察官   格鲁吉亚,2010-2014

如果您想要列表中的所有不同行,可以使用:

judge_record = soup.find('div', class_='field__item even')
result_text = [x.strip() for x in judge_record.contents if isinstance(x, NavigableString)]
print(result_text)

from bs4 import BeautifulSoup, NavigableString使用此功能。

输出:

['Born 1974  in Madison, WI', '', '', 'Judge, U.S. District Court for the Middle District of Georgia', 'Nominated by Barack Obama on March 11, 2014, to a seat vacated by W. Louis Sands. Confirmed by the Senate on November 18, 2014, and received commission on November 20, 2014.', '', '', 'Brown University, B.A., 1997', 'Yale Law School, J.D., 2002', '', '', '', '']

如果你不想要空行(''),你可以改用它。

result_text = [x.strip() for x in judge_record.contents if isinstance(x, NavigableString) and x.strip()]
print(result_text)

输出:

['Born 1974  in Madison, WI', 'Judge, U.S. District Court for the Middle District of Georgia', 'Nominated by Barack Obama on March 11, 2014, to a seat vacated by W. Louis Sands. Confirmed by the Senate on November 18, 2014, and received commission on November 20, 2014.', 'Brown University, B.A., 1997', 'Yale Law School, J.D., 2002']