我正在用beautifulsoup和请求在google colab上进行网络剪贴。在这里,我只是抓取Google新闻的标题。下面是代码:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())
beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.text)
问题是,当我运行单元格时,它既不显示输出(标题列表)也不显示错误。请帮助它困扰我2天。
答案 0 :(得分:2)
您可能需要显示下一个span
元素中的文本。可以这样完成:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
#print(soup.prettify())
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.find_next('span').text)
这将使您开始输出如下内容:
I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv
您可以使用以下方法将标题写入CSV格式的文件:
import requests
from bs4 import BeautifulSoup
import csv
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['Headline'])
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
headline = headlines.find_next('span').text
print(headline)
csv_output.writerow([headline])
当前,这仅产生一个称为Headline
的列
答案 1 :(得分:0)
执行以下脚本,您应该获得所需的结果。如果使用选择器,则脚本将更加简洁。
但是,使用.find_all()
:
import requests
from bs4 import BeautifulSoup
def get_headlines(url):
request = requests.get(url)
soup = BeautifulSoup(request.text,"lxml")
headlines = [item.find_next("span").text for item in soup.find_all("h3")]
return headlines
if __name__ == '__main__':
link = 'https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en'
for titles in get_headlines(link):
print(titles)
要使用.select()
进行相同的操作,请在脚本中进行以下更改:
headlines = [item.text for item in soup.select("h3 > a > span")]
return headlines