我偶然发现了使用Beautiful Soup进行刮擦的优秀post,我决定承担从互联网上抓取一些数据的任务。
我正在使用Flight Radar 24中的航班数据,并使用博文中描述的内容尝试自动搜索飞行数据页面。
import requests
import bs4
root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'
def get_flight_id_urls():
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text)
return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]
flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
temp_url = root_url + flight_id_url
response = requests.get(temp_url)
soup = bs4.BeautifulSoup(response.text)
try:
table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
flight_data = {}
flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
flight_data['tr'] = row #error here
print (flight_data)
except AttributeError as e:
raise ValueError("No valid table found")
的样本
我跌跌撞撞地走到桌边,然后意识到我不知道如何横向移动表属性以获取嵌入每列的数据。
任何善良的灵魂都有任何线索,甚至是介绍的教程,以便我可以阅读如何提取数据。
P.S:获得Miguel Grinberg的优秀教程已添加
try:
table = soup.find('table')
rows = table.find_all('tr')
heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
flight_data = {}
flight_data['title'] = soup.select('div#cntPagePreTitle h1')[0].get_text()
flight_data['From'] = tr.select('td.From')
flight_data['To'] = tr.select('td.To')
print (flight_data)
except AttributeError as e:
raise ValueError("No valid table found")
我更改了代码的最后一部分以形成数据对象,但我似乎无法获取数据。
最终修改:
import requests
import bs4
root_url = 'http://www.flightradar24.com'
index_url = root_url + '/data/flights/tigerair-tgw/'
def get_flight_id_urls():
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text)
return [a.attrs.get('href') for a in soup.select('div.list-group a[href^=/data]')]
flight_id_urls = get_flight_id_urls()
for flight_id_url in flight_id_urls:
temp_url = root_url + flight_id_url
response = requests.get(temp_url)
soup = bs4.BeautifulSoup(response.text)
try:
table = soup.find('table')
rows = table.find_all('tr')
for row in rows:
flight_data = {}
flight_data['flight_number'] = tr['data-flight-number']
flight_data['from'] = tr['data-name-from']
print (flight_data)
except AttributeError as e:
raise ValueError("No valid table found")
P.S.S:感谢@amow的大力帮助:D
答案 0 :(得分:4)
以html中的表格table
开头。
heads = [i.text.strip() for i in table.select('thead th')]
for tr in table.select('tbody tr'):
datas = [i.text.strip() for i in tr.select('td')]
print dict(zip(heads, datas))
<强>输出强>
{
u'STD': u'06:30',
u'Status': u'Scheduled',
u'ATD': u'-',
u'From': u'Singapore (SIN)',
u'STA': u'07:55',
u'\xa0': u'', #This is the last column and have no meaning
u'To': u'Penang (PEN)',
u'Aircraft': u'-',
u'Date': u'2015-04-19'
}
如果要获取tr标签中的数据。只需使用
tr['data-data'] tr['data-flight-number']
等等。