Question

我正在用beautifulsoup和请求在google colab上进行网络剪贴。在这里，我只是抓取Google新闻的标题。下面是代码：

import requests
from bs4 import BeautifulSoup

def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
INTO SOMETHING THAT IS EASY TO READ'''

request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())

beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

for headlines in soup.find_all('a', {'class': 'VDXfz'}):
   print(headlines.text)

问题是，当我运行单元格时，它既不显示输出（标题列表）也不显示错误。请帮助它困扰我2天。

Answer 1

您可能需要显示下一个span元素中的文本。可以这样完成：

import requests
from bs4 import BeautifulSoup

def beautiful_soup(url):
    '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
       INTO SOMETHING THAT IS EASY TO READ'''

    request = requests.get(url)
    soup = BeautifulSoup(request.text, "lxml")
    #print(soup.prettify())
    return soup

soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

for headlines in soup.find_all('a', {'class': 'VDXfz'}):
    print(headlines.find_next('span').text)

这将使您开始输出如下内容：

I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv

您可以使用以下方法将标题写入CSV格式的文件：

import requests
from bs4 import BeautifulSoup
import csv

def beautiful_soup(url):
    '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
       INTO SOMETHING THAT IS EASY TO READ'''

    request = requests.get(url)
    soup = BeautifulSoup(request.text, "lxml")
    return soup

soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(['Headline'])

    for headlines in soup.find_all('a', {'class': 'VDXfz'}):
        headline = headlines.find_next('span').text
        print(headline)
        csv_output.writerow([headline])

当前，这仅产生一个称为Headline的列

Answer 2

执行以下脚本，您应该获得所需的结果。如果使用选择器，则脚本将更加简洁。

但是，使用.find_all()：

import requests
from bs4 import BeautifulSoup

def get_headlines(url):
    request = requests.get(url)
    soup = BeautifulSoup(request.text,"lxml")
    headlines = [item.find_next("span").text for item in soup.find_all("h3")]
    return headlines

if __name__ == '__main__':
    link = 'https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en'
    for titles in get_headlines(link):
        print(titles)

要使用.select()进行相同的操作，请在脚本中进行以下更改：

headlines = [item.text for item in soup.select("h3 > a > span")]
return headlines

为什么我没有得到输出或网页抓取错误？

2 个答案: