Question

我正在使用Beautiful Soup并请求尝试从html页面中抓取文本信息，如本文底部所示。我尝试过使用

judge_record = judge_soup.find("div", {"class": "field__item even"})

然后

result = judge_record.findAll("br")

在br标记和粗体标记之间提取文本。

不幸的是，当我这样做时，我才会回来：

[ , , , , , , , , , , , Private practice, Washington, D.C., 2003-2006, 2007-2010 Private practice, Atlanta, Georgia, 2006-2007 Assistant U.S. attorney, Northern District of Georgia, 2010-2014 , Private practice, Atlanta, Georgia, 2006-2007 Assistant U.S. attorney, Northern District of Georgia, 2010-2014 , Assistant U.S. attorney, Northern District of Georgia, 2010-2014 , ] [Finished in 1.0s]

这是因为 标签没有互补的结束标签吗？

任何建议都将不胜感激。

<div class="field field--name-judge-record-display field--type-ds field--label-hidden">
    <div class="field__items">
        <div class="field__item even">Born 1974  in Madison, WI

            <br><br>
            <b>Federal Judicial Service:</b>

            <br> Judge, U.S. District Court for the Middle District of Georgia</br>
            <br>Nominated by Barack Obama on March 11, 2014, to a seat vacated by W. Louis Sands. Confirmed by the Senate on November 18, 2014, and received commission on November 20, 2014. 
            <br><br>
            <b>Education:</b>

            <br> Brown University, B.A., 1997
            <br>Yale Law School, J.D., 2002

            <br><br>
            <b>Professional Career:</b>

            <br>
            <p>Law clerk, Hon. Marvin J. Garbis, U.S. District Court, District of Maryland, 2002-2003
            <br/>


            Private practice, Washington, D.C., 2003-2006, 2007-2010<br />
            Private practice, Atlanta, Georgia, 2006-2007<br />

            Assistant U.S. attorney, Northern District of Georgia, 2010-2014<br />
            </p>


</div>

Answer 1

要获取div标记内的文字，您可以使用get_text()功能。

judge_record = soup.find('div', class_='field__item even')
print(judge_record.get_text(' ', strip=True))

输出：

1974年出生于美国威斯康星州麦迪逊联邦司法部门：美国法官格鲁吉亚中区地方法院由巴拉克提名奥巴马于2014年3月11日，由W. Louis Sands腾出一个席位。参议院于2014年11月18日确认并获得佣金 2014年11月20日。教育：布朗大学，B.A。，1997年耶鲁大学法学院学校，J.D。，2002年职业生涯：律师，Hon。马文·J 马里兰州美国地方法院Garbis，2002-2003私人实践，华盛顿特区，2003-2006,2007-2010私人执业，佐治亚州亚特兰大市，2006-2007北区助理美国检察官格鲁吉亚，2010-2014

如果您想要列表中的所有不同行，可以使用：

judge_record = soup.find('div', class_='field__item even')
result_text = [x.strip() for x in judge_record.contents if isinstance(x, NavigableString)]
print(result_text)

您from bs4 import BeautifulSoup, NavigableString使用此功能。

输出：

['Born 1974  in Madison, WI', '', '', 'Judge, U.S. District Court for the Middle District of Georgia', 'Nominated by Barack Obama on March 11, 2014, to a seat vacated by W. Louis Sands. Confirmed by the Senate on November 18, 2014, and received commission on November 20, 2014.', '', '', 'Brown University, B.A., 1997', 'Yale Law School, J.D., 2002', '', '', '', '']

如果你不想要空行（''），你可以改用它。

result_text = [x.strip() for x in judge_record.contents if isinstance(x, NavigableString) and x.strip()]
print(result_text)

输出：

['Born 1974  in Madison, WI', 'Judge, U.S. District Court for the Middle District of Georgia', 'Nominated by Barack Obama on March 11, 2014, to a seat vacated by W. Louis Sands. Confirmed by the Senate on November 18, 2014, and received commission on November 20, 2014.', 'Brown University, B.A., 1997', 'Yale Law School, J.D., 2002']

使用Beautiful Soup从未正确格式化的标签中抓取文本

1 个答案: