如何使用python从这个表中提取信息(理想情况下是BeautifulSoup)

时间:2017-01-13 01:18:40

标签: python html web-scraping beautifulsoup

我试图从此页面收集信息:http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database#q/page=2

特别是,我尝试使用BeautifulSoup从表中收集信息。我有以下代码:

pagelink = 'http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database#q/page=2'
page = urllib2.urlopen(pagelink)
soup = BeautifulSoup(page)
soup.prettify()
print soup

当我这样做时,表格的内容(在" tablebody"标签内)不会显示。为什么是这样?我如何从该表中提取信息?

4 个答案:

答案 0 :(得分:1)

您要查找的内容不是来自该网址

所以基本上当您在现代网络浏览器中手动浏览网页时,您看到的内容,通常并非完全来自您请求的网址。整个过程将是:从您最初请求的网址获取内容 - >解析内容 - >加载CSS / JavaScript /图像(大多数时候来自不同的网址) - >根据CSS / JavaScript askes 布局页面/发出额外请求。 可能看起来像你所得到的只是你最初在地址栏中输入的网址,但是实际上浏览器会为你完全渲染一个网页而做大量的幕后工作即可。

现在回到您正在浏览的页面,该表的内容实际上由JavaScript填充,浏览器先解析,然后发出额外请求以获取内容并呈现为整页

您可以使用FiddlerCharles等工具来捕获整个过程并分析所有流量,以了解幕后发生的事情,在这种情况下,&# 39; sa POST请求获取该表的内容:

POST http://www.gatesfoundation.org/services/gfo/search.ashx HTTP/1.1
Host: www.gatesfoundation.org
Connection: keep-alive
Content-Length: 209
Accept: */*
Origin: http://www.gatesfoundation.org
X-Requested-With: XMLHttpRequest
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
Content-Type: application/json; charset=UTF-8;
Referer: http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.8
Cookie: gfo#lang=en; ASP.NET_SessionId=bdgjkbuyxxxcmfm40ejl2j1j; s_vnum=1641950372052%26vn%3D1; s_vi=[CS]v1|2C3C15910519363E-60000611E0003318[CE]; _vwo_uuid_v2=226610E3774AD35E29B29E7C20948349|f180edd6ae6830ab3de2432cd15b0bd4; __atuvc=3%7C2; __atuvs=58782b230157ce4a002; s_cc=true; s_nr=1484270424338; s_lv=1484270424339; s_lv_s=First%20Visit; s_invisit=true; gpv_p14=Awarded%20Grants; gpv_p19=How%20We%20Work; gpv_p21=no%20value; s_ppn=Awarded%20Grants; s_ppvl=Awarded%2520Grants%2C39%2C39%2C638%2C1366%2C638%2C1366%2C768%2C1%2CP; s_sq=%5B%5BB%5D%5D; s_ppv=Awarded%2520Grants%2C67%2C67%2C638%2C1366%2C638%2C1366%2C768%2C1%2CP

{"freeTextQuery":"","fieldQueries":"(@gfomediatype==\"Grant\")","facetsToRender":["gfocategories","gfotopics","gfoyear","gforegions"],"page":"2","resultsPerPage":"12","sortBy":"gfodate","sortDirection":"desc"}

响应是JSON格式

{
  "topResults": [],
  "results": [
    {
      "amount": 648140,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-19T08:00:00",
      "description": "to validate biomarkers of growth stunting and environmental enteric dysfunction for the purpose of better understanding and diagnosing these related disease states",
      "grantee": "Stanford University",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Stanford University",
      "topics": [
        "Enteric Diseases and Diarrhea"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1161946",
      "year": "2016"
    },
    {
      "amount": 550000,
      "categories": [
        "Global Development"
      ],
      "date": "2016-12-15T08:00:00",
      "description": "to provide vital life-saving and sustaining support to populations most affected by conflict in Syria",
      "grantee": "World Vision",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "World Vision",
      "topics": [
        "Emergency Response"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169747",
      "year": "2016"
    },
    {
      "amount": 3315475,
      "categories": [
        "Global Development"
      ],
      "date": "2016-12-15T08:00:00",
      "description": "to fund activities focused on generating political will and building momentum for investment in nutrition at country level and supporting the development and implementation of the nutrition...",
      "grantee": "African Development Bank",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "African Development Bank",
      "topics": [
        "Nutrition"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1158425",
      "year": "2016"
    },
    {
      "amount": 500,
      "categories": [
        "Special Projects"
      ],
      "date": "2016-12-14T08:00:00",
      "description": "to provide for general operating support",
      "grantee": "City Club",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "City Club",
      "topics": [
        "Community Grants"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169105",
      "year": "2016"
    },
    {
      "amount": 78522,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-12T08:00:00",
      "description": "to make the first description of specific histo-blood group antigens (HBGAs) in Zambian children and to assess their influence on immunogenicity of rotavirus vaccines.",
      "grantee": "CIDRZ",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "CIDRZ",
      "topics": [
        "Enteric Diseases and Diarrhea",
        "Vaccine Delivery",
        "Vaccine Development"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1162810",
      "year": "2016"
    },
    {
      "amount": 300000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-09T08:00:00",
      "description": "to provide matching i3 funds with the goal of building professional capacity through effective professional development for teacher leaders and principals to improve college ready outcomes...",
      "grantee": "Leading Educators Inc",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Leading Educators Inc",
      "topics": [
        "K-12 Education"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169456",
      "year": "2016"
    },
    {
      "amount": 85330,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-09T08:00:00",
      "description": "to collect and analyze existing data from multiple data streams from Asian and African sites to characterize early burden of rotavirus disease, which is less-well characterized than...",
      "grantee": "Emory University",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Emory University",
      "topics": [
        "Enteric Diseases and Diarrhea",
        "Vaccine Delivery",
        "Vaccine Development"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1163272",
      "year": "2016"
    },
    {
      "amount": 13000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-08T08:00:00",
      "description": "to support LearnLaunch Across Boundaries Conference",
      "grantee": "LearnLaunch Institute",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "LearnLaunch Institute",
      "topics": [
        "K-12",
        "K-12 Education"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1169222",
      "year": "2016"
    },
    {
      "amount": 250000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-08T08:00:00",
      "description": "to improve outcomes for English Language Learners in Seattle and South King County",
      "grantee": "OneAmerica",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "OneAmerica",
      "topics": [
        "Community Grants"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1164859",
      "year": "2016"
    },
    {
      "amount": 85000,
      "categories": [
        "Global Health"
      ],
      "date": "2016-12-08T08:00:00",
      "description": "to fund cholera / enteric researchers (travel costs) to attend the 51st US-Japan Cholera Conference that they would otherwise not be able to afford to contribute to.",
      "grantee": "International Vaccine Institute",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "International Vaccine Institute",
      "topics": [
        "Enteric Diseases and Diarrhea"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1168711",
      "year": "2016"
    },
    {
      "amount": 6000,
      "categories": [
        "Special Projects"
      ],
      "date": "2016-12-07T08:00:00",
      "description": "to provide for general operating support",
      "grantee": "Center for US Global Leadership",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Center for US Global Leadership",
      "topics": [
        "Community Grants"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1167614",
      "year": "2016"
    },
    {
      "amount": 3000000,
      "categories": [
        "US Program"
      ],
      "date": "2016-12-07T08:00:00",
      "description": "to support the Center on Education and the Workforce's research and policy agenda to better align postsecondary education and the workforce, with an emphasis on inequalities in the...",
      "grantee": "Georgetown University",
      "iconUrl": "",
      "languageCode": "en",
      "mediaType": "Grant",
      "regions": [
        ""
      ],
      "subtitle": null,
      "thumbnailAltText": "",
      "thumbnailUrl": "",
      "title": "Georgetown University",
      "topics": [
        "Postsecondary Success"
      ],
      "url": "/How-We-Work/Quick-Links/Grants-Database/Grants/2016/12/OPP1165028",
      "year": "2016"
    }
  ],
  "facets": [
    {
      "field": "gfocategories",
      "items": [
        {
          "name": "US Program",
          "count": 5859
        },
        {
          "name": "Global Development",
          "count": 4441
        },
        {
          "name": "Global Health",
          "count": 3719
        },
        {
          "name": "Communications",
          "count": 1149
        },
        {
          "name": "Global Policy & Advocacy",
          "count": 879
        },
        {
          "name": "Special Projects",
          "count": 465
        }
      ]
    },
    {
      "field": "gfotopics",
      "items": [
        {
          "name": "Community Grants",
          "count": 2393
        },
        {
          "name": "K-12 Education",
          "count": 2007
        },
        {
          "name": "Global Policy & Advocacy",
          "count": 1507
        },
        {
          "name": "Communications",
          "count": 1246
        },
        {
          "name": "Discovery and Translational Sciences",
          "count": 1227
        },
        {
          "name": "Agricultural Development",
          "count": 866
        },
        {
          "name": "K-12",
          "count": 862
        },
        {
          "name": "HIV",
          "count": 690
        },
        {
          "name": "Global Libraries",
          "count": 671
        },
        {
          "name": "Vaccine Delivery",
          "count": 655
        },
        {
          "name": "Postsecondary Success",
          "count": 645
        },
        {
          "name": "Family Health: Family Planning",
          "count": 625
        },
        {
          "name": "Family Health: Nutrition",
          "count": 530
        },
        {
          "name": "Family Health: Maternal, Newborn, and Child Health",
          "count": 433
        },
        {
          "name": "Community Relations",
          "count": 420
        },
        {
          "name": "Vaccine Development",
          "count": 393
        },
        {
          "name": "Not Available",
          "count": 383
        },
        {
          "name": "Malaria",
          "count": 377
        },
        {
          "name": "Water, Sanitation, and Hygiene",
          "count": 374
        },
        {
          "name": "Emergency Response",
          "count": 368
        },
        {
          "name": "Enteric Diseases and Diarrhea",
          "count": 359
        },
        {
          "name": "Family Interest Grants",
          "count": 313
        },
        {
          "name": "Pneumonia",
          "count": 286
        },
        {
          "name": "Nutrition",
          "count": 284
        },
        {
          "name": "Financial Services for the Poor",
          "count": 277
        },
        {
          "name": "Tuberculosis",
          "count": 277
        },
        {
          "name": "Libraries",
          "count": 262
        },
        {
          "name": "Charitable Sector Support",
          "count": 224
        },
        {
          "name": "Pacific Northwest: Family Homelessness",
          "count": 223
        },
        {
          "name": "College Ready",
          "count": 205
        },
        {
          "name": "Research & Development",
          "count": 195
        },
        {
          "name": "Polio",
          "count": 188
        },
        {
          "name": "Pacific Northwest: Early Learning",
          "count": 182
        },
        {
          "name": "Integrated Delivery",
          "count": 172
        },
        {
          "name": "Table Sponsorships",
          "count": 164
        },
        {
          "name": "Integrated Development",
          "count": 119
        },
        {
          "name": "Strategic Partnerships",
          "count": 117
        },
        {
          "name": "India",
          "count": 116
        },
        {
          "name": "Neglected Tropical Diseases",
          "count": 115
        },
        {
          "name": "Africa",
          "count": 89
        },
        {
          "name": "Special Initiatives (Active projects are now part of other strategies)",
          "count": 67
        },
        {
          "name": "Neglected and Infectious Diseases",
          "count": 66
        },
        {
          "name": "China",
          "count": 43
        },
        {
          "name": "Scholarships",
          "count": 39
        },
        {
          "name": "Tobacco",
          "count": 33
        },
        {
          "name": "Europe",
          "count": 22
        },
        {
          "name": "Special Initiatives",
          "count": 22
        },
        {
          "name": "Philanthropic Partnerships",
          "count": 17
        },
        {
          "name": "Europe Office",
          "count": 4
        }
      ]
    },
    {
      "field": "gfoyear",
      "items": [
        {
          "name": "2009 and earlier",
          "count": 6608
        },
        {
          "name": "2015",
          "count": 1652
        },
        {
          "name": "2016",
          "count": 1546
        },
        {
          "name": "2013",
          "count": 1473
        },
        {
          "name": "2014",
          "count": 1472
        },
        {
          "name": "2012",
          "count": 1260
        },
        {
          "name": "2011",
          "count": 1240
        },
        {
          "name": "2010",
          "count": 921
        },
        {
          "name": "2017",
          "count": 3
        }
      ]
    },
    {
      "field": "gforegions",
      "items": [
        {
          "name": "North America",
          "count": 5817
        },
        {
          "name": "Sub-Saharan Africa",
          "count": 1546
        },
        {
          "name": "Asia",
          "count": 1192
        },
        {
          "name": "Middle East, North Africa, and Greater Arabia",
          "count": 223
        },
        {
          "name": "South America",
          "count": 152
        },
        {
          "name": "Europe",
          "count": 130
        },
        {
          "name": "Central America and the Caribbean",
          "count": 110
        },
        {
          "name": "Australia and Oceania",
          "count": 29
        }
      ]
    }
  ],
  "totalCount": 16175
}

使用内置的json模块,您可以轻松提取所需的信息。

答案 1 :(得分:1)

您可以使用https://packagecontrol.io/packages/HyperClick来获取它:

import dryscrape
from bs4 import BeautifulSoup

ses = dryscrape.Session()
ses.visit("http://www.gatesfoundation.org/How-We-Work/Quick-Links/Grants-Database#q/page=2")
s = BeautifulSoup(ses.body())
s2 = s.select("table.table.push-bottom")[0]
print s2

答案 2 :(得分:0)

您无法按预期使用BeautifulSoup4,因为该网页是通过JavaScript呈现的。

您可以使用dryscrapeselenium。在我看来,Dryscrape更加用户友好,但Windows上没有正式支持。

另外,请查看avis'关于此问题的优秀答案:

https://stackoverflow.com/a/26440563/1429776

答案 3 :(得分:0)

此页面由JavaScript呈现,请求或urllib无法处理JS,它们只会获取html代码。正如你所看到的,没有桌子。

Disable the JS in your browser

使用硒或模仿本页的要求。