我想从网站上删除文章标题,但结果没有显示

时间:2019-08-14 08:12:54

标签: python web-scraping

我想从《纽约时报》网站上抓取新闻报道的标题并将其添加到列表中,但结果显示为空列表。

当我在soup.findAll行中仅添加'a'时,它可以正常工作(它会打印所有链接),但是当我将其更改为class时,它将无法工作。

import requests
from bs4 import BeautifulSoup

def get_titles():

    tlist = []
    url = 'https://www.nytimes.com/'
    get_link = requests.get(url)
    get_link_text = get_link.text
    soup = BeautifulSoup(get_link_text,'html.parser')
    for row in soup.findAll('h2', {'class': 'balancedHeadline'}):
        tlist.append(row)

    print(tlist)

get_titles()

1 个答案:

答案 0 :(得分:1)

该网页由js动态呈现。因此,您必须使用selenium进行抓取。

然后,h2标题没有名为balancedHeadline的类,因此您必须在span内选择h2

尝试一下:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def get_titles():

    tlist = []
    url = 'https://www.nytimes.com/'
    browser = webdriver.Firefox()
    browser.get(url)
    soup = BeautifulSoup(browser.page_source)
    for row in soup.find_all('h2', {'class': 'esl82me0'}):
        spantext = row.find('span', {'class': 'balancedHeadline'})
        if spantext:
            tlist.append(spantext.text)

    print(tlist)

get_titles()

结果:

[
'U.S. Delays Some China Tariffs Until Stores Stock Up for Holidays',
'After a Chaotic Night of Protests, Calm at Hong Kong Airport, for Now',
'Guards at Jail Where Epstein Died Were Sleeping, Officials Say',
'How a Trump Ally Tested the Boundaries of Washington’s Influence Game',
'‘Juul-alikes’ Are Filling Shelves With Sweet, Teen-Friendly Nicotine Flavors',
'A Boom Time for the Bunker Business and Doomsday Capitalists',
'Introducing The 1619 Project'
]

编辑:

我没有看到没有span的标题,所以我进行了测试,您会找到所有标题:

代码:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def get_titles():

    tlist = []
    url = 'https://www.nytimes.com/'
    browser = webdriver.Firefox()
    browser.get(url)
    soup = BeautifulSoup(browser.page_source)
    for row in soup.find_all('h2', {'class': 'esl82me0'}):
        span = row.find('span', {'class': 'balancedHeadline'})
        if span:
            tlist.append(span.text)
        else:
            tlist.append(row.text)

    print(tlist)

get_titles()

结果:

['Your Wednesday Briefing',
 'Listen to ‘The Daily’',
 'The Book Review Podcast',
 'U.S. Delays Some China Tariffs Until Stores Stock Up for Holidays',
 'While visiting a chemical plant, Mr. Trump railed against China, former '
 'President Barack Obama and the news media.',
 'Two counties in California filed a lawsuit to block the administration’s new '
 'green card “wealth” test.',
 'After a Chaotic Night of Protests, Calm at Hong Kong Airport, for Now',
 'Protesters apologized after scenes of violence and disorder at the airport.',
 'Guards at Jail Where Epstein Died Were Sleeping, Officials Say',
 'How a Trump Ally Tested the Boundaries of Washington’s Influence Game',
 'Here are four takeaways from our report on Mr. Broidy.',
 '‘Juul-alikes’ Are Filling Shelves With Sweet, Teen-Friendly Nicotine Flavors',
 'A Boom Time for the Bunker Business and Doomsday Capitalists',
 'The Cold Truth About the Jeffrey Epstein Case',
 '‘My Name Is Darlin. I Just Came Out of Detention.’',
 'Trump and Xi Sittin’ in a Tree',
 'This Drug Will Save Children’s Lives. It Costs $2 Million.',
 'The Battle for Hong Kong Is Being Fought in Sydney and Vancouver',
 'No Need to Deport Me. This Dreamer’s Dream Is Dead.',
 'Threats to Animals: Pesticides. Pollution. President Trump.',
 'Jeffrey Epstein and When to Take Conspiracies Seriously',
 'Why Trump Fears Women of Color',
 'The Religious Hunger of the Radical Right',
 'No, I Won’t Sign Your Petition',
 'Introducing The 1619 Project',
 'A Surfing Adventure in … Ireland?',
 'When the Creepy Carnival Comes to Town']