Question

我意识到这可能是一个非常具体的问题，但我正在努力摆脱使用下面的代码获得的部分文本。我需要一个简单的文章文本，通过查找＆＃34; p＆＃34; ＆＃39; class＆＃39;下的标签：＆＃39; mol-para-with-font＆＃39;。不知何故，我得到了许多其他的东西，如作者的署名，日期戳，最重要的是来自页面上广告的文字。检查html我看不到它们包含相同的＆＃39;类＆＃39;：＆＃39; mol-para-with-font＆＃39;所以我很困惑（或者我可能已经盯着它看了太久......）。我知道这里有很多html大师，所以我很感激你的帮助。

我的代码：

import requests
import translitcodec
import codecs

def get_text(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style', 'table']):
        s.decompose()

    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( ['p', {'class':'mol-para-with-font'}])]    
    article = '\n'.join(article_soup)

    text = codecs.encode(article, 'translit/one').encode('ascii', 'replace') #replace traslit with ascii
    text = u"{}".format(text) #encode to unicode
    print text

url = 'http://www.dailymail.co.uk/femail/article-4703718/How-Alexander-McQueen-Kate-s-royal-tours.html'
get_text(url)

Answer 1

只有2017-07-18 08:08:49.000 EM45_PackagingLine 1 2017-07-18 09:31:50.000 EM45_PackagingLine 1 - s与'p'？这将给你：

class="mol-para-with-font"

Beautifulsoup：排除不需要的部分

1 个答案: