Question

我需要在新闻文章中计算字符数。有些页面有很多我不需要的东西（导航，页脚等）。我设法摆脱了所有这些，但我还有一些东西，如图像版权，图像和视频字幕以及我努力删除的广告。任何人都可以建议如何改进下面的代码，只从文章中获得有用的文字？

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)
for s in soup.findAll("div", {"class":"story-body__inner"}):
    article = ''.join(s.findAll(text=True))
print(article)    
print (len(article))

此特定网址的代码会产生此问题（顶部仅用于说明问题）：

Image copyright
AFP


Image caption

                    Erdogan supporters began celebrating early outside party headquarters in Ankara


Turks have backed President Recep Tayyip Erdogan's call for sweeping new presidential powers, partial official results of a referendum indicate.With about 98% of ballots counted, "Yes" was on about 51.3% and "No" on about 48.7%.Erdogan supporters say replacing the parliamentary system with an executive presidency would modernise the country. Opponents have attacked a decision to accept unstamped ballot papers as valid unless proven otherwise.The main opposition Republican People's Party (CHP) is already demanding a recount of 60% of the votes.


            /**/
            (function() {
                if (window.bbcdotcom && bbcdotcom.adverts && bbcdotcom.adverts.slotAsync) {
                    bbcdotcom.adverts.slotAsync('mpu', [1,2,3]);
                }
            })();
            /**/

Answer 1

您似乎不需要script和figure代码，因此：

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.bbc.co.uk/news/world-europe-39612562")
soup = BeautifulSoup(r.content)

# delete unwanted tags:
for e in soup(['figure', 'script']):
    e.decompose()

article_soup = [e.get_text() for e in soup.find_all(
                'div', {'class': 'story-body__inner'})]

article = ''.join(article_soup)
print(article)    
print (len(article))

BeautifulSoup：进一步清理文章文本

1 个答案: