为什么我无法在正文中访问信息?

时间:2019-06-17 19:54:45

标签: web-scraping beautifulsoup

[这是网站的源代码] [1]我正在使用BeautifulSoup进行网页抓取,但无法在tbody中找到tr;该网站的源代码中实际上包含了tr。但是,find_all函数只能返回thead中的tr。

我要抓取的链接:https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year

这是我的一些代码:

```from bs4 import BeautifulSoup

```url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
```html = urlopen(url)
```soup = BeautifulSoup(html,'lxml')
```type(soup)
```tr = soup.find_all("tr")
```print(tr)


  [1]: https://i.stack.imgur.com/NFwEV.png

2 个答案:

答案 0 :(得分:1)

要使用通过检查元素看到的选择器来获取表格内容,可以尝试使用此 pyppeteer ,下面将向我展示如何工作用。以下方法是异步方法。因此,除非您找到任何可以使用的api,否则我建议您继续这样做:

import asyncio
from pyppeteer import launch

url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"

async def get_table(link):
    browser = await launch(headless=False)
    [page] = await browser.pages()
    await page.goto(link)
    await page.waitForSelector("table.js-report-builder-table tr td")
    for tr in await page.querySelectorAll("table.js-report-builder-table tr"):
        tds = [await page.evaluate('e => e.innerText',td) for td in await tr.querySelectorAll("th,td")]
        print(tds)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(get_table(url))

输出类似于:

['Name', 'Organization', 'Date', 'Location', 'Attack', 'Type of Death', 'Charge']
['Abadullah Hananzai', 'Radio Azadi,Radio Free Europe/Radio Liberty', 'April 30, 2018', 'Afghanistan', 'Killed', 'Murder', '']
['Abay Hailu', 'Agiere', 'February 9, 1998', 'Ethiopia', 'Killed', 'Dangerous Assignment', '']
['Abd al-Karim al-Ezzo', 'Freelance', 'December 21, 2012', 'Syria', 'Killed', 'Crossfire', '']
['Abdallah Bouhachek', 'Révolution et Travail', 'February 10, 1996', 'Algeria', 'Killed', 'Murder', '']
['Abdel Aziz Mahmoud Hasoun', 'Masar Press', 'September 5, 2013', 'Syria', 'Killed', 'Crossfire', '']
['Abdel Karim al-Oqda', 'Shaam News Network', 'September 19, 2012', 'Syria', 'Killed', 'Murder', '']

答案 1 :(得分:0)

数据是通过API返回json进行请求的,即它是动态添加的,因此不会出现在您对登录页面的请求中。您可以在网络标签中找到用于获取信息的API端点。

您可以将参数之一更改为大于预期结果集的数字,然后检查是否需要提出其他请求。

import requests

r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()

否则,您可以进行初始调用,并验证还有多少个请求可以发出并更改url中的相应参数。您可以看到pageCount已返回。

您可以在此处查看有关页面大小20的相关部分:

{'rowCount': 1343,
 'pageNum': 1,
 'pageSize': '20',
 'pageCount': 68,

有关获得所有结果的循环的所有相关信息。

更改为更大的数字后,您会看到以下内容:

'rowCount': 1343,
 'pageNum': 1,
 'pageSize': '2000',
 'pageCount': 1,

您可以使用熊猫转换为表格:

import requests
import pandas as pd

r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
print(df)

df的示例:

enter image description here


检查实际计数并进一步请求记录请求的示例:

import requests
import pandas as pd

request_number = 1000

with requests.Session() as s:
    r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
    df = pd.DataFrame(r['data'])
    actual_number = r['rowCount']
    if actual_number > request_number:
        request_number = actual_number - request_number
        r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=2&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
        df2 = pd.DataFrame(r['data'])
        final = pd.concat([df,df2])
    else:
        final = df