尝试使用Beautifulsoup查找多个span标记之间的所有文本

时间:2016-05-19 05:37:01

标签: python beautifulsoup

我试图从文章(http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK)中获取一段文字,下面是我想要获得的特定代码部分

<span id="midArticle_start"></span>

<span id="midArticle_0"></span>
<span class="focusParagraph"><p><span class="articleLocation">YANGON</span>
  Standing among the party seeing off Myanmar's new president as he left for Russia on Wednesday was leading businessman Htun Myint Naing, better known as Steven Law.</p></span>
<span id="midArticle_1"></span><p>Only the day before, the United States had added six of his companies to the Treasury's blacklist, a move that is unlikely to hamper the tycoon's business empire significantly.</p>
<span id="midArticle_2"></span><p>President Barack Obama's sanctions policy on Myanmar, updated on Tuesday, aims to strike a balance between targeting individuals without undermining development or deterring U.S. businesses eying the country as it opens up to global trade.</p>
<span id="midArticle_3"></span><p>Underlining how tricky that balance is, Law may actually gain commercially from the latest changes, even if they do make it harder for him to portray himself as an internationally accepted businessman close to the new democratic government.</p>
<span id="midArticle_4"></span><p>"Though (sanctions) are not meant to have a blanket effect on the country, their intended targets often play outsize roles ... controlling critical infrastructure impacting trade and business for ordinary citizens," said Nyantha Maw Lin, managing director at consultancy Vriens & Partners in Yangon.</p>
<span id="midArticle_5"></span><p>On Tuesday, Washington eased some restrictions on Myanmar but also strengthened measures against Law by adding six firms connected to him and his conglomerate, Asia World, to the Treasury blacklist.</p>
<span id="midArticle_6"></span><p>Yet the blacklisting, which attracted considerable attention in Myanmar, looks like a formality given that the companies were already covered by sanctions, because they were owned 50 percent or more by Law or Asia World. Law was sanctioned in 2008 for alleged ties to Myanmar's military.</p>
<span id="midArticle_7"></span><p>More important for Law was the U.S. decision to further ease restrictions on trading through his shipping port and airports, extending a temporary six month allowance set in December to an indefinite one.</p>
<span id="midArticle_8"></span><p></p>
<span id="midArticle_9"></span><p>PORTS BACK IN FAVOR</p>
<span id="midArticle_10"></span><p>Law is one of the most powerful and well-connected businessmen in Myanmar with close ties to China.</p>
<span id="midArticle_11"></span><p>He is not, however, universally popular at home or abroad because of alleged ties to the military, which ruled Myanmar with an iron fist until 2011.</p>
<span id="midArticle_12"></span>

最终目标是将每个句子作为我以后可以使用的单独对象,例如

print(sentence1)
周三,当他离开前往俄罗斯的缅甸新任总统时,站在党内,是领导商人Htun Myint Naing,更为人所知的是Steven Law。

print(sentence2)

〜就在前一天,美国已将他的六家公司加入财政部的黑名单,这一举措不太可能严重阻碍这个大亨的商业帝国。

我的代码只检索第一句话但没有过去,如下所示:

import requests
from bs4 import BeautifulSoup
z = requests.get("http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK/")
url2 = 'http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK'
response2 = requests.get(url2)

soup2 = BeautifulSoup(response2.content, "html.parser")
first_sentence = soup2.p.get_text()
print(first_sentence)
second_sentence = soup2.p.find_all_next()
print(second_sentence)

如果有人能帮我弄清楚如何单独获得所有句子,那将非常感激。我已经尝试过在其他stackoverflow问题中讨论过的方法 Finding next occuring tag and its enclosed text with Beautiful SoupUsing beautifulsoup to extract text between line breaks (e.g. <br /> tags)

3 个答案:

答案 0 :(得分:0)

您的问题可能是find_all_next()方法返回在起始元素(先前匹配的<p>)之后出现的所有匹配项,并且由于您没有指定要匹配的标记,它会匹配所有内容

如果您将其更改为soup2.p.find_all_next("p"),您将在页面上获得所有剩余的<p>标记,然后您可以通过使用类似

soup2 = BeautifulSoup(response2.content, "html.parser")
first_sentence = soup2.p.get_text()
print(first_sentence)
for sentence in soup2.p.find_all_next("p")
    print(sentence.get_text())

如果您只是删除其他变量并使用findAll()代替,则更简单:

soup2 = BeautifulSoup(response2.content, "html.parser")
for sentence in soup2.find_all("p")
    print(sentence.get_text())

答案 1 :(得分:0)

您可以返回<p>中所有<span>个元素,其中id等于&#39; articleText&#39;使用CSS选择器#articleText p

>>> import requests
>>> from bs4 import BeautifulSoup
>>> url2 = 'http://www.reuters.com/article/us-myanmar-usa-sanctions-idUSKCN0Y92RK'
>>> response2 = requests.get(url2)
>>> soup2 = BeautifulSoup(response2.content, "html.parser")
>>> for sentence in soup2.select("#articleText p"):
...     print(sentence.get_text())
...     print()
... 
YANGON Standing among the party seeing off Myanmar's new president as he left for Russia on Wednesday was leading businessman Htun Myint Naing, better known as Steven Law.

Only the day before, the United States had added six of his companies to the Treasury's blacklist, a move that is unlikely to hamper the tycoon's business empire significantly.

President Barack Obama's sanctions policy on Myanmar, updated on Tuesday, aims to strike a balance between targeting individuals without undermining development or deterring U.S. businesses eying the country as it opens up to global trade.

Underlining how tricky that balance is, Law may actually gain commercially from the latest changes, even if they do make it harder for him to portray himself as an internationally accepted businessman close to the new democratic government.

......
......

答案 2 :(得分:0)

你可以尝试:soup2.p.find_all_next(text = True)

像这样:

second_sentence = soup2.p.find_all_next(text=True)

for item in second_sentence:

       print(item.split('\n'))