在两个BeautifulSoup元素之间拉文本

时间:2016-04-01 13:57:50

标签: python beautifulsoup html-parsing

在一些地方已经提出并回答了整个问题: http://www.resolvinghere.com/sof/18408799.shtml

How to get all text between just two specified tags using BeautifulSoup?

但是在尝试实现时,我的字符串非常繁琐。

我的设置: 我试图从总统辩论中提取成绩单文本,我想我会从这里开始:http://www.presidency.ucsb.edu/ws/index.php?pid=111500

我可以用

分隔成绩单
transcript = soup.find_all("span", class_="displaytext")[0]

成绩单的格式不理想。每几行文字都有一个<p>,它们表示使用嵌套<b>的扬声器发生了变化。例如:

<p><b>TRUMP:</b> First of all, I have to say, as a businessman, I get along with everybody. I have business all over the world. [<i>booing</i>]</p>,
<p>I know so many of the people in the audience. And by the way, I'm a self-funder. I don't have — I have my wife and I have my son. That's all I have. I don't have this. [<i>applause</i>]</p>,
<p>So let me just tell you, I get along with everybody, which is my obligation to my company, to myself, et cetera.</p>,
<p>Obviously, the war in Iraq was a big, fat mistake. All right? Now, you can take it any way you want, and it took — it took Jeb Bush, if you remember at the beginning of his announcement, when he announced for president, it took him five days.</p>,
<p>He went back, it was a mistake, it wasn't a mistake. It took him five days before his people told him what to say, and he ultimately said, "It was a mistake." The war in Iraq, we spent $2 trillion, thousands of lives, we don't even have it. Iran has taken over Iraq, with the second-largest oil reserves in the world.</p>,
<p>Obviously, it was a mistake.</p>,
<p><b>DICKERSON:</b> So...</p>

但就像我说的,不是一个新问题。定义一个开始和结束标记,遍历元素,只要当前!= next,添加文本。

所以我正在测试单个元素以获得正确的细节。

startTag = transcript.find_all('b')[165]
endTag = transcript.find_all('b')[166]
content = []
content += startTag.string
content

我得到的结果是[u'R', u'U', u'B', u'I', u'O', u':']而不是[u'RUBIO:']

我错过了什么?

1 个答案:

答案 0 :(得分:2)

我们的想法是找到成绩单中的所有b元素,然后获取每个b元素的父元素并找到下一段,直到其中有一个b元素。实现:

from bs4 import BeautifulSoup, Tag
import requests

url = "http://www.presidency.ucsb.edu/ws/index.php?pid=111500"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html5lib")
transcript = soup.find("span", class_="displaytext")
for item in transcript.find_all("b")[3:]:  # skipping first irrelevant parts
    part = [" ".join(sibling.get_text(strip=True) if isinstance(sibling, Tag) else sibling.strip()
                     for sibling in item.next_siblings)]
    for paragraph in item.parent.find_next_siblings("p"):
        if paragraph.b:
            break

        part.append(paragraph.get_text(strip=True))

    print(item.get_text(strip=True))
    print("\n".join(part))
    print("-----")

打印:

DICKERSON:
Good evening. I'm John Dickerson. This holiday weekend, as America honors our first president, we're about to hear from six men who hope to be the 45th. The candidates for the Republican nomination are here in South Carolina for their ninth debate, one week before this state holds the first-in-the-South primary.
George Washington ...
-----
DICKERSON:
Before we get started, candidates, here are the rules. When we ask you a question, you will have one minute to answer, and 30 seconds more if we ask a follow-up. If you're attacked by another candidate, you get 30 seconds to respond.
...
-----
TRUMP:
Well, I can say this. If the president, and if I were president now, ...
相关问题