我正在尝试获取代码为“CATAC2021”的第二列下的数据,其中“aaaa”是 {{3}使用 Python。这些是事件的 ID。
我尝试使用下面的代码访问表的第二列并从行中检索 ID 数据,但到目前为止我似乎没有成功。有谁知道我哪里出错了/如何纠正?
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://shakemapcam.ethz.ch/archive/').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
if 'CATAC2021' in th.string:
desired_columns.append([headers.index(th), th.getText()])
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
cells = row.findAll('td')
row_name = row.findNext('th').getText()
for column in desired_columns:
print(cells[column[0]].text, row_name, column[1])
答案 0 :(得分:1)
我会在这里使用熊猫来抓取表格,然后使用正则表达式来提取模式(在四位数之后和第一个 /
之前。请注意,虽然有一个 Event ID
列,所以请确保您知道其中的区别。我将其命名为 eventId
。
import pandas as pd
url = 'http://shakemapcam.ethz.ch/archive/'
df = pd.read_html(url, header =0)[-1]
df['eventID'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[1]
df['prefix'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[0]
输出:
print(df[['Name/Epicenter','prefix','eventId']])
Name/Epicenter prefix eventId
0 CATAC2021efod / 6.354496002 / -76.18144226 CATAC efod
1 CATAC2021edxe / 15.67289066 / -93.40866852 CATAC edxe
2 CATAC2021ebzg / 9.406171799 / -84.55581665 CATAC ebzg
3 CATAC2021eayx / 14.03658199 / -92.30122375 CATAC eayx
4 CATAC2021eayx / 14.03546429 / -92.30183411 CATAC eayx
... ... ...
1574 ineterloc2018acor / 12.21397209 / -86.7282486 ineterloc acor
1575 ineterloc2018acor / 12.21113586 / -86.73029327 ineterloc acor
1576 ineterloc2018acor / 12.20839691 / -86.73122406 ineterloc acor
1577 ineterloc2018aatd / 16.59416389 / -86.35289764 ineterloc aatd
1578 ineterloc2018aatd / 16.64553833 / -86.26078796 ineterloc aatd
[1579 rows x 3 columns]