Question

这似乎与从news.google.com抓取内容相关的其他问题重复，但并不是因为他们只是请求整个HTML代码，而不是文章的网址链接。

我正在尝试创建两个功能，可以从news.google.com中删除新闻或根据用户输入的内容获取新闻，例如：

>>> news top
> <5 url of top stories in news.google.com>

或

>>> news london
> <5 london related news url from news.google.com>

这是我正在进行的代码工作（因为我对抓取/请求不太熟悉，我不知道如何进展）：

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
    except:
        print ("Error while retrieving data!")
        return
    tree = html.fromstring(response.text)
    news = tree.xpath("//div[@class='l _HId']/text()")
    print (news)

我确实知道/text()没有获取网址，但我不知道如何，因此问题。

如果您愿意，可以添加它以使其看起来更好：

news = "<anything>".join(news)

为了清理问题，user_define_input[0]将是＆＃34; news＆＃34;从用户输入的内容。而user_define_input[1]将是搜索，即：＆＃34;伦敦＆＃34;。因此所有结果都应该与伦敦有关。如果你愿意花时间让我的其他功能从news.google.com获取所有热门故事，非常感谢你！：）（它应该是类似的代码，所以我不会在这里发布与此相关的任何内容）

帮助后的代码（仍无法正常工作）：

def get_news(user_define_input):
    try:
        response = requests.get("https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=test&oq="+format(user_define_input[1]))
        except:
            print ("Error while retrieving data!")
                return
    tree = html.fromstring(response.text)
    url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")
    for url in url_to_news:
        print(url)
    summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")
    title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")
    print (summary_of_the_new)
    print (title_of_the_new)

Answer 1

我理解您想要的是获取用户输入url时出现的所有新闻的query，对吧？

要做到这一点，你需要这个xpath表达式：

url_to_news = tree.xpath(".//div[@class='esc-lead-article-title-wrapper']/h2[@class='esc-lead-article-title']/a/@href")

它会返回一个包含新闻网址的列表。

由于它是一个列表，要迭代url，你只需要一个for循环：

for url in url_to_news:
    print(url)

附加组件：

要获得新闻摘要，您需要：

summary_of_the_new = tree.xpath(".//div[@class='esc-lead-snippet-wrapper']/text()")

最后，新闻的标题将是：

title_of_the_new = tree.xpath(".//span[@class='titletext']/text()")

之后，您可以将所有信息映射到一起。如果您需要进一步的帮助，请评论此答案。我按照我的理解回答了这个问题。

Answer 2

检查我的实施@ http://mpand.github.io/gnp/

将故事和URL作为JSON对象

返回

来自谷歌新闻的新闻

2 个答案: