Question

我想只获取任何网站页面内容的文本。我正在使用BeautifulSoup来做到这一点。

我写了一个如下函数：

def textClean(text): 
    """ This function takes the input text and cleans the HTML tags from it

    """

    from bs4 import BeautifulSoup
    souptext=BeautifulSoup(text)
    print text
    print souptext.get_text()

这将打印原始的html源代码，然后是文本。

然而，这是我得到的示例输出：

HTML输出:(首次打印声明）

<p><img style="float:right;" src="http://static4.businessinsider.com/image/56eb68e791058427008b72e5-907-680/5550538407_c22babffba_b.jpg" alt="radar" data-mce-source="US Navy" data-mce-caption="Mineman Seaman Charles Bryan watches for contacts on the SPA 256 radar while on watch in the Combat Directive Center aboard the mine countermeasures ship USS Ardent (MCM 12)." data-link="https://www.flickr.com/photos/usnavy/5550538407/in/photolist-9stXG4-e6i1uU-e6i1tE-dLSiBQ-c9jmg7-f5LbtS-r9jw69-efvjaN-duNiV6-efpeEP-eW8Dg9-q1nZiQ-en2osX-duNiTa-njkj3s-eep3Mb-kUdU5g-9d7u4E-eeoYiC-fr2CuX-axHdte-fsVD3D-drHPeJ-9rAVac-cnMSiW-9vVcbN-enB31b-f23pKF-aBjveY-9rEhwY-9u6GZy-9rDT9L-bojAAh-9uiNiU-9AJSrB-9rFxwQ-bjkanD-aefpN9-ea2WB2-ea2WyR-a1tUoa-9rAUXZ-ea8Bf9-9Wm3Z8-9rNE7o-enB1YY-9rAUHX-ea2WpF-aNR7eD-9NX2pq" /><span class="source">US Navy</span></p><p>The United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday.</p>

第二次输出:(第二次打印声明）

US NavyThe United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday.

如果您在标记

之间看到文字

<span class="source">US Navy</span></p>

也被提取，我不想要，因为我们看到原始文章（下面的链接）文本不是原始文章的一部分。

我知道get_text（）会获取所有文本，所以我想要一个简单的解决方案，我们可以指定在段落标记之间提取文本但排除span标记，因为我不认为span标记中的文本是原文。

以下是我使用的文章的链接。

enter link description here

Edit1：

获取如下输出：每列都转换为unicode。

这是我编写的映射函数代码，用于映射Spark DataFrame的每个记录，并从数据框的'desc'列清除HTML标记。

def htmlParsing(x): 
    """ This function takes the input text and cleans the HTML tags from it

    """

    from bs4 import BeautifulSoup
    #print text
    row=x.asDict()
    textcleaned=''
    souptext=BeautifulSoup(row['desc'])
    #souptext=BeautifulSoup(text)
    p_tags=souptext.find_all('p')
    for p in p_tags: 
        if p.string:
            #textcleaned+=p.string
            ret_list= (int(row['id']),(row['title']),(p.string))
            return ret_list
            #print p.string


sdf_cleaned=sdf_rss.map(htmlParsing)        

sdf_cleaned.take(4)

[（ - 33753621， u'Royal Bank of Scotland正在测试可以解决您的银行业务问题的机器人（RBS）'，如果您讨厌与银行出纳员或客户服务代表打交道，那么苏格兰皇家银行可能会为您提供解决方案。'），（-761323061，你们开玩笑会促使儿童色情法律进行彻底改革， u'Rampant青少年性爱使得全国各地的政治家和执法当局都在努力寻找起诉学生儿童色情片之间的某种法律中间立场，让他们摆脱困境。'），（1405376555，你们进一步审查，中国已经开始在南中国海建设一个新项目，美国海军负责人周四表示，美国已经看到中国在近四年前从菲律宾掠夺的珊瑚礁周围活动，这可能是有争议的南海更多土地开垦的前兆。（-1882022821， u'Ignition锁定法正在降低酒后驾车死亡率'，据研究显示，与没有这些要求的州相比，要求被定罪的醉酒驾驶员在他们的汽车中安装点火联锁装置的国家与酒精相关的死亡事故死亡率下降了15％。'）]

Answer 1

import requests, bs4
r = requests.get('http://www.businessinsider.com/r-exclusive-us-sees-new-chinese-activity-around-south-china-sea-shoal-2016-3')
soup = bs4.BeautifulSoup(r.text, 'lxml')

p_tags = soup.find_all('p')
for p in p_tags:
    if p.string:
        print(p.string)

.string

如果一个标签只有一个孩子，那个孩子就是一个孩子   NavigableString，子项以.string：
的形式提供
如果是标签   包含多个东西，然后不清楚.string应该是什么   引用，所以.string被定义为None：

因此，sting只返回仅包含文本的p标签。

出：

  The United States has seen Chinese activity around a reef that
  China seized from the Philippines nearly four years ago that
  could be a precursor to more land reclamation in the disputed
  South China Sea, the U.S. Navy chief said on Thursday.


  The head of U.S. naval operations, Admiral John Richardson,
  expressed concern that an international court ruling expected in
  coming weeks on a case brought by the Philippines against China
  over its South China Sea claims could be a trigger for Beijing to
  declare an exclusion zone in the busy trade route.


  Richardson told Reuters the United States was weighing responses
  to such a move.


  He said the U.S. military had seen Chinese activity around
  Scarborough Shoal in the northern part of the Spratly
  archipelago, about 125 miles (200 km) west of the Philippine base
  of Subic Bay.


  "I think we see some surface ship activity and those sorts of
  things, survey type of activity, going on. Thatâs an area of
  concern ... a next possible area of reclamation," he said.


  Richardson said it was unclear if the activity near the reef,
  which China seized in 2012, was related to the pending
  arbitration decision.


  He said China's pursuit of South China Sea territory, which has
  included massive land reclamation to create artificial islands
  elsewhere in the Spratlys, threatened to reverse decades of open
  access and introduce new "rules" that required countries to
  obtain permission before transiting those waters.


  He said that was a worry given that 30 percent of the world's
  trade passes through the region.


  Asked whether China could respond to the ruling by the court of
  arbitration in The Hague by declaring an air defense
  identification zone, or ADIZ, as it did farther north in the East
  China Sea in 2013, Richardson said: "Itâs definitely a concern."


  "We will just have to see what happens," he said. "We think about
  contingencies and â¦ responses."


  Richardson said the United States planned to continue carrying
  out freedom-of-navigation exercises within 12 nautical miles of
  disputed South China Sea geographical features to underscore its
  concerns about keeping sea lanes in the region open.


  The United States responded to the East China Sea ADIZ by flying
  B-52 bombers through the zone in a show of force in November
  2013.


  Richardson said he was struck by how China's increasing
  militarization of the South China Sea had increased the
  willingness of other countries in the region to work together,
  not just bilaterally, but also multilaterally.


  India and Japan joined the U.S. Navy in the Malabar naval
  exercise since 2014, and were slated to take part again this year
  in an even more complex exercise that will take place in an area
  close to the East and South China Seas.


  South Korea, Japan and the United States were also working
  together more closely than ever before, he said.


  Richardson said the United States would welcome the participation
  of other countries in joint patrols with the United States in the
  South China Sea, but those decisions needed to be made by the
  countries in question.


  He said the U.S. military saw good opportunities to build and
  rebuild relationships with countries such as Vietnam, the
  Philippines and India, which have all realized the importance of
  safeguarding the freedom of the seas.


  He cited India's recent hosting of an international fleet review
  that included 75 ships from 50 navies, and said the United States
  was exploring opportunities to increase its use of ports in the
  Philippines and Vietnam, among others - including the former U.S.
  naval base at Vietnam's Cam Ranh Bay.


  But he said Washington needed to proceed judiciously rather than
  charging in "very fast and very heavy," given the enormous
  influence and importance of the Chinese economy in the region.


  "We have to be sophisticated in how we approach this so that we
  donât force any of our partners into an uncomfortable position
  where they have to make tradeoffs that are not in their best
  interest," he said.


  "We would hope to have an approach that would ... include us a
  primary partner but not necessarily to the exclusion of other
  partners in the region," he said.

The United States has seen Chinese activity...
5 innovations in radiology that could impact everything from the Zika virus to dermatology
Keep tabs on the latest from Business Insider in our new Chrome Extension
Available on iOS or Android

Answer 2

正如您所注意到的，get_text()会消费所有标签并检索其下方的文字。

您需要使用类似的内容定位您的代码。

from bs4 import BeautifulSoup

html = '''
<p>
  <img style="float:right;" src="http://static4.businessinsider.com/image/56eb68e791058427008b72e5-907-680/5550538407_c22babffba_b.jpg" alt="radar" data-mce-source="US Navy" data-mce-caption="Mineman Seaman Charles Bryan watches for contacts on the SPA 256 radar while on watch in the Combat Directive Center aboard the mine countermeasures ship USS Ardent (MCM 12)." data-link="https://www.flickr.com/photos/usnavy/5550538407/in/photolist-9stXG4-e6i1uU-e6i1tE-dLSiBQ-c9jmg7-f5LbtS-r9jw69-efvjaN-duNiV6-efpeEP-eW8Dg9-q1nZiQ-en2osX-duNiTa-njkj3s-eep3Mb-kUdU5g-9d7u4E-eeoYiC-fr2CuX-axHdte-fsVD3D-drHPeJ-9rAVac-cnMSiW-9vVcbN-enB31b-f23pKF-aBjveY-9rEhwY-9u6GZy-9rDT9L-bojAAh-9uiNiU-9AJSrB-9rFxwQ-bjkanD-aefpN9-ea2WB2-ea2WyR-a1tUoa-9rAUXZ-ea8Bf9-9Wm3Z8-9rNE7o-enB1YY-9rAUHX-ea2WpF-aNR7eD-9NX2pq" />
  <span class="source">US Navy</span>
</p>
<p>
  The United States has seen Chinese activity around a reef that China seized from the Philippines nearly four years ago that could be a precursor to more land reclamation in the disputed South China Sea, the U.S. Navy chief said on Thursday.
</p>'''

soup = BeautifulSoup(html, "html.parser")

print souptext.find_all('p')[1].get_text()

使用BeautifulSoup

2 个答案: