使用python刮取NYTimes的搜索结果

时间:2014-08-14 01:40:01

标签: python html web-scraping

我正试图从NYTimes中搜索搜索结果。例如,我用这个

开始我的抓取过程
url = "http://query.nytimes.com/search/sitesearch/?action=click&contentCollection&region=TopBar&WT.nav=searchWidget&module=SearchSubmit&pgtype=Homepage#/%22big+data%22/30days/articles/1/allauthors/oldest/"

但是,我可以使用python下载的html没有任何搜索结果。有没有办法可以访问html,好像我在网络浏览器上打开链接一样?

如果我在网络浏览器上打开链接,下面是我可以“检查元素”的html的一部分:

<div class="searchResults" id="searchResults" style="display: none;">
    <ol class="searchResultsList flush" style="display: block;">
        <li class="story noThumb">
            <div class="element2">
                <h3>
                    <a href="http://www.nytimes.com/2014/07/16/technology/apple-and-ibm-in-broad-software-deal-for-businesses.html">Apple Joins With IBM on Business Software </a>
                </h3>
                <p class="summary">The applications, Mr. Cook said, will bring “<strong>big data</strong> analytics down to the fingertips” of Apple iPhone and iPad users in corporations. “IBM can&nbsp;...</p>
                <div class="storyMeta">
                    <span class="dateline">July 15, 2014</span> - 
                    <span class="byline">By BRIAN X. CHEN and STEVE LOHR</span> - 
                    <span class="section">Technology - article</span> - 
                    <span class="printHeadline">Print Headline: "Apple Joins With IBM on Business Software"</span>
                </div>
            </div>
        </li>
        <li class="story">

理想输出将是:

<a href="http://www.nytimes.com/2014/07/16/technology/apple-and-ibm-in-broad-software-deal-for-businesses.html">Apple Joins With IBM on Business Software </a>

谢谢!

1 个答案:

答案 0 :(得分:2)

返回搜索结果的实际请求是XHR。用Python模拟它。

使用requests的示例:

import requests

url = 'http://query.nytimes.com/svc/cse/v2pp/sitesearch.json'
params = {
    'query': "big data",
    'date_range_lower': '30daysago',
    'pt': 'article',
    'sort_order': 'a'
}

response = requests.get(url, params=params)
data = response.json()
for result in data['results']['results']:
    print result.get('og:url')

打印:

http://www.nytimes.com/2014/07/15/upshot/politically-18-year-olds-look-a-lot-like-people-in-their-20s.html
http://www.nytimes.com/2014/07/15/business/vw-to-add-suv-production-to-chattanooga-plant.html
http://www.nytimes.com/2014/07/15/business/media/germany-1-world-cup-fever-1000.html
http://www.nytimes.com/2014/07/15/business/international/winding-road-ahead-for-us-europe-trade-talks.html
http://www.nytimes.com/2014/07/15/business/daily-stock-market-activity.html
http://www.nytimes.com/2014/07/14/business/international/airlines-step-up-investment-to-meet-passenger-growth.html
http://www.nytimes.com/2014/07/15/business/international/eurozone-industrial-production-drops.html
http://www.nytimes.com/2014/07/14/business/international/airline-passengers-weigh-in-with-online-reviews.html
http://www.nytimes.com/2014/07/16/technology/a-deluge-of-comment-on-net-rules.html
http://www.nytimes.com/2014/07/16/upshot/as-growth-in-health-care-spending-slows-asking-if-a-trend-will-last.html