我试图从nbcnews.com中提取故事。我目前有以下代码:
import urllib2
from bs4 import BeautifulSoup
# The page that I'm getting stories from
url = 'http://www.nbcnews.com/'
data = urllib2.urlopen(url)
soup = BeautifulSoup(data, 'html.parser')
#This is the tag and class that chrome told me "top stories" are stored in
this = soup.find_all('div', attrs={"class": "col-sm-6 col-md-8 col-lg-9"})
#Get the a tags in the previous tag (this is the part that returns FAR too many links
link = [a for i in this for a in i.find_all('a')]
#Get the titles (This works)
title = [a.get_text() for i in link for a in i.find_all('h3')]
#The below strips all newlines and tabs from the title name
newtitle = []
for i in t:
s = ' '.join(i.split())
if s in newtitle:
pass
else:
newtitle.append(s)
print len(link)
print len(title)
当我运行脚本时,"标题"列表(大多数)是正确的,网站上的标题名称略有不同(如果标题名称接近同一个标题,则标题名称不是问题)
我的问题是"链接"列表似乎包含来自各地的链接?有人可以帮我弄这个吗?
或者,如果可能的话,是否有可用于此类似内容的API?如果能避免的话,我真的不想重新发明新闻报道。
编辑:更改了变量名中的拼写错误
答案 0 :(得分:1)
查看相关网页,看起来所有新故事都在h3
标记中,类别为item-heading
。您可以使用BeautifulSoup选择所有故事标题,然后使用BeautifulSoup的.parent
方法在HTML树中向上步进并访问它们包含在其中的a href
:
In [54]: [i.parent.attrs["href"] for i in soup.select('a > h3.item-heading')]
Out[55]:
[{'href': '/news/us-news/civil-rights-groups-fight-trump-s-refugee-ban-uncertainty-continues-n713811'},
{'href': '/news/us-news/protests-erupt-nationwide-second-day-over-trump-s-travel-ban-n713771'},
{'href': '/politics/politics-news/some-republicans-criticize-trump-s-immigration-order-n713826'},
... # trimmed for readability
]
我已经使用了列表解析,但您可以分解为复合步骤:
# select all `h3` tags with the matching class that are contained within an `a` link.
# This excludes any random links elsewhere on the page.
story_headers = soup.select('a > h3.item-heading')
# Iterate through all the matching `h3` items and access their parent `a` tag.
# Then, within the parent you have access to the `href` attribute.
list_of_links = [i.parent.attrs for i in story_headers]
# Finally, extract the links into a tidy list
links = [i["href"] for i in list_of_links]
获得链接列表后,您可以遍历它以检查第一个字符是/
是否只匹配本地链接而不是外部链接。