[这是网站的源代码] [1]我正在使用BeautifulSoup进行网页抓取,但无法在tbody中找到tr;该网站的源代码中实际上包含了tr。但是,find_all函数只能返回thead中的tr。
这是我的一些代码:
```from bs4 import BeautifulSoup
```url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
```html = urlopen(url)
```soup = BeautifulSoup(html,'lxml')
```type(soup)
```tr = soup.find_all("tr")
```print(tr)
[1]: https://i.stack.imgur.com/NFwEV.png
答案 0 :(得分:1)
要使用通过检查元素看到的选择器来获取表格内容,可以尝试使用此 pyppeteer ,下面将向我展示如何工作用。以下方法是异步方法。因此,除非您找到任何可以使用的api,否则我建议您继续这样做:
import asyncio
from pyppeteer import launch
url = "https://cpj.org/data/killed/?status=Killed&motiveConfirmed%5B%5D=Confirmed&type%5B%5D=Journalist&start_year=1992&end_year=2019&group_by=year"
async def get_table(link):
browser = await launch(headless=False)
[page] = await browser.pages()
await page.goto(link)
await page.waitForSelector("table.js-report-builder-table tr td")
for tr in await page.querySelectorAll("table.js-report-builder-table tr"):
tds = [await page.evaluate('e => e.innerText',td) for td in await tr.querySelectorAll("th,td")]
print(tds)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(get_table(url))
输出类似于:
['Name', 'Organization', 'Date', 'Location', 'Attack', 'Type of Death', 'Charge']
['Abadullah Hananzai', 'Radio Azadi,Radio Free Europe/Radio Liberty', 'April 30, 2018', 'Afghanistan', 'Killed', 'Murder', '']
['Abay Hailu', 'Agiere', 'February 9, 1998', 'Ethiopia', 'Killed', 'Dangerous Assignment', '']
['Abd al-Karim al-Ezzo', 'Freelance', 'December 21, 2012', 'Syria', 'Killed', 'Crossfire', '']
['Abdallah Bouhachek', 'Révolution et Travail', 'February 10, 1996', 'Algeria', 'Killed', 'Murder', '']
['Abdel Aziz Mahmoud Hasoun', 'Masar Press', 'September 5, 2013', 'Syria', 'Killed', 'Crossfire', '']
['Abdel Karim al-Oqda', 'Shaam News Network', 'September 19, 2012', 'Syria', 'Killed', 'Murder', '']
答案 1 :(得分:0)
数据是通过API返回json进行请求的,即它是动态添加的,因此不会出现在您对登录页面的请求中。您可以在网络标签中找到用于获取信息的API端点。
您可以将参数之一更改为大于预期结果集的数字,然后检查是否需要提出其他请求。
import requests
r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
否则,您可以进行初始调用,并验证还有多少个请求可以发出并更改url中的相应参数。您可以看到pageCount已返回。
您可以在此处查看有关页面大小20的相关部分:
{'rowCount': 1343,
'pageNum': 1,
'pageSize': '20',
'pageCount': 68,
有关获得所有结果的循环的所有相关信息。
更改为更大的数字后,您会看到以下内容:
'rowCount': 1343,
'pageNum': 1,
'pageSize': '2000',
'pageCount': 1,
您可以使用熊猫转换为表格:
import requests
import pandas as pd
r = requests.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=2000&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
print(df)
df的示例:
检查实际计数并进一步请求记录请求的示例:
import requests
import pandas as pd
request_number = 1000
with requests.Session() as s:
r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=1&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df = pd.DataFrame(r['data'])
actual_number = r['rowCount']
if actual_number > request_number:
request_number = actual_number - request_number
r = s.get('https://cpj.org/api/datamanager/reports/entries?distinct(personId)&includes=organizations,fullName,location,status,typeOfDeath,charges,startDisplay,mtpage,country,type,motiveConfirmed&sort=fullName&pageNum=2&pageSize=' + str(request_number) + '&in(status,%27Killed%27)&or(eq(type,%22media%20worker%22),in(motiveConfirmed,%27Confirmed%27))&in(type,%27Journalist%27)&ge(year,1992)&le(year,2019)').json()
df2 = pd.DataFrame(r['data'])
final = pd.concat([df,df2])
else:
final = df