Question

我认为使用find_all应该会给我页面上的所有段落。但下面的代码只选择第一个。我很确定我错过了一些非常明显的东西......我很感激你的帮助！

我的代码：

from bs4 import BeautifulSoup
import requests

def get_text(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style', 'table']):
        s.decompose()

    # use separator to separate paragraphs and subtitles!
    article_soup = [s.get_text(separator=" ", strip=True) for s in soup.find_all( 'p', {'class': 'speakable'})]

    text = ' '.join(article_soup)
    print text

url = 'http://money.cnn.com/2017/06/22/news/paris-air-show-boeing-airbus/index.html'
get_text(url)

Answer 1

article_soup列表有2个项目，因为＆lt; {1}}列表的数量是＆lt; p class =＆＃34; speakable＆＃34; ＆GT;页面上的标签，因此text仅包含前两个段落如果你想要完整的文章，你必须获得所有＆lt; p> ＆lt;里面的元素div id =＆＃34; storytext＆＃34; ＆GT;标签。
如果您稍微修改article_soup理解中的代码，则可以解决此问题：

article_soup = [ 
    s.get_text(separator=" ", strip=True) 
    for s in soup.find('div', {'id':'storytext'}).find_all('p')
]

BeautifulSoup find_all仅选择第一段

1 个答案: