Question

最终目标是为语音处理提供干净的纯文本。这意味着我需要删除子标题，链接，项目符号等。下面的代码显示了我一点一点地清理一个示例url的步骤。我现在被困在两件很常见且总是具有相同结构的东西上。

'按记者姓名，城市'
'了解详情：link'

我不擅长正则表达式，但我认为这可能有助于删除这两部分。或者也许有人可以提出另一种处理这些模式的方法。谢谢！

我的代码：

import requests
from bs4 import BeautifulSoup
import translitcodec
import codecs

def get_text(url):
    page_class = 'story-body__inner'
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")

    # remove unwanted parts by class
    try:
        soup.find('div', class_='social-embed-post social-embed-twitter').decompose()
        soup.find('div', class_='social-embed').decompose()   
        soup.find('a', class_='off-screen jump-link').decompose()
        soup.find('p', class_='off-screen').decompose()
        soup.find('a', class_='embed-report-link').decompose()
        soup.find('a', class_='story-body__link').decompose()
    except: AttributeError

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style', 'table', 'ul', 'h2', 'blockquote']):
        s.decompose()

    # use separator to separate paragraphs and subtitles!
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': page_class})]

    text = '\n'.join(article_soup)
    text = codecs.encode(text, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
    text = u"{}".format(text) #encode to unicode

    print text
    return text

url = 'http://www.bbc.co.uk/news/world-us-canada-41724827'
get_text(url)

Answer 1

您不需要regex。

由于您只想要新闻文章的主要内容（甚至不是标题，因为您删除了代码中的h2标记），因此首先找到所有p元素然后过滤会更容易你不需要的东西。

您要删除的三件事是：

新闻阅读器的详细信息：这些内容包含在段落中的strong标记内。据我所见，没有其他段落包含strong元素。
引用其他文章：以“阅读更多：”开头，后跟链接。幸运的是，在这样的段落中a元素之前有一个固定的字符串。所以你不需要正则表达式。您只需使用p.find(text='Read more: ')。
来自Twitter帖子的文字：这些不会出现在网络浏览器上。在页面中嵌入每个推特图片后，有一个p元素，其中包含文本“@some_twitter_id推特帖子的结尾”。显然你不想要这个。

修改

主要新闻内容可以在div class story-body__inner的单and中找到。

我已更新代码以解决不打印包含链接的段落的问题。第二个条件中的or必须替换为and not (p.has_attr('dir'))。我添加了另一个条件dir，因为包含Twitter帖子的段落中包含paragraphs = soup.find('div', {'class': 'story-body__inner'}).findAll('p') for p in paragraphs: if p.find('strong') == None \ and (p.find(text='Read more: ') == None or p.find('a') == None) \ and not (p.has_attr('class') and 'off-screen' in p['class']) \ and not (p.has_attr('dir')): print(p.text.strip())属性。

function checkAddress() { var addresses = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Addresses"); var baseUrl = 'https://my.api.website/xxxxx&address='; var data = addresses.getRange(1, 1,addresses.getLastRow()).getValues(); for(var i=0;i<data.length;i++){ var addrID = data[i][0]; var url = baseUrl.concat(addrID); var responseAPI = UrlFetchApp.fetch(url); var json = JSON.parse(responseAPI.getContentText()); var data1 = [[json.result]]; var dataRange = addresses.getRange(i+1,2,1).setValue(data1); } }

美丽汤去除特定图案后清理文字

1 个答案: