获取Google新闻的链接列表

时间:2017-06-02 17:10:55

标签: python

我使用Python的BeautifulSoup将Google新闻链接作为列表。这是我到目前为止所得到的:

import requests
from bs4 import BeautifulSoup
import re
#url is just some google link, not to worried about being able to search from Python code
url = "https://www.google.com.mx/search?biw=1526&bih=778&tbm=nws&q=amazon&oq=amazon&gs_l=serp.3..0l10.1377.2289.0.2359.7.7.0.0.0.0.116.508.5j1.6.0....0...1.1.64.serp..1.5.430.0.19SoRsczxCA"
#this part of the code avoids error 403, we need to identify ourselves
browser = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers={'User-Agent':browser,}
#getting our html
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
#looking for links and adding them up as a list
links = soup.findAll("a")
for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
list=(re.split(":(?=http)",link["href"].replace("/url?q=","")))
print(list)

我的问题是:为什么某些链接不起作用?例如:

Forbes El Financiero El Mundo Cnet

2 个答案:

答案 0 :(得分:0)

此代码应该有效:

import requests
from bs4 import BeautifulSoup
import re

url = "https://www.google.com.mx/search?biw=1526&bih=778&tbm=nws&q=amazon&oq=amazon&gs_l=serp.3..0l10.1377.2289.0.2359.7.7.0.0.0.0.116.508.5j1.6.0....0...1.1.64.serp..1.5.430.0.19SoRsczxCA"
browser = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
headers = {'User-Agent':browser,}
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
links = soup.findAll("a")

l = []

for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
    l.append(re.split(":(?=http)",link["href"].replace("/url?q=",""))[0])

print(l)

一些注意事项:

  • 切勿将list用作变量名称!这是列表类型的保留字!
  • 如果你想要链接。你应该将它们附加到你的列表中,而不是覆盖你的变量!为此目的使用list.append方法。
  • re.split返回列表,您应该选择其中的第一个变量(这就是我使用[0]的原因)。

答案 1 :(得分:0)

在浏览器中打开时,您提到的所有链接都会显示“404 Page Not Found”错误。因此链接被破坏或死亡。您可以参考此wiki

在使用BeautifulSoup解析页面内容之前,您需要检查网址response status code

...
page = requests.get(url)
if page.status_code == requests.codes.ok:
    soup = BeautifulSoup(page.content, "lxml")
    ....