我需要从https://www.airliners.net/
的最近7天点击中获取以下所有数据。然后出现飞机照片列表。是否有可能遍历所有这些。第一张图片的例子。
Aeroflot-Russian Airlines / Sukhoi SSJ-100-95-LR-100 Superjet 100 (RRJ-95LR) /
Moscow - Sheremetyevo (SVO / UUEE) / Russia - May 5, 2019 / REG: RA-89098 / MSN: 95135
在此示例中,有56页可供循环。目前,我必须花费整个周末的时间来完成我的航空项目的粘贴工作。希望可以使用python解决这个问题
我尝试使用一些网页抓取代码,但无法正常工作
如果可能,我希望将数据保存在逗号分隔文件或csv文件中。
答案 0 :(得分:0)
这可能没有经过100%的测试,而是有帮助的。
# -*- coding: utf-8 -*-
import pandas
import requests
import lxml.html
from sys import exit
from pprint import pprint
data = []
with requests.Session() as session:
loop = 1
while True:
response = session.get('https://www.airliners.net/search', headers={
'authority': 'www.airliners.net',
'upgrade-insecure-requests': '1',
'dnt': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'referer': 'https://www.airliners.net/',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'es-ES,es;q=0.9,en;q=0.8',
}, params=(
('dateAccepted', '7'),
('sortBy', 'viewCount'),
('page', loop)
))
root = lxml.html.fromstring(response.content.strip().decode("utf-8"))
elements = root.xpath('//div[@class="ps-v2-results ps-v2-results-display-detail photo-grid"]/div')
if len(elements) == 0:
break
# Each element
for index,row in enumerate(elements):
element = row.xpath('div/div')
if 'spacer' in row.xpath('@class')[0]:
continue
caption = None
state = None
try:
photo,aircraft,id_number,location_date,photographer = element
except ValueError:
# ps-v2-results-col ps-v2-results-col-caption
photo,aircraft,id_number,location_date,photographer,caption = element
# ps-v2-results-col ps-v2-results-col-photo
# ps-v2-results-col ps-v2-results-col-aircraft
# ps-v2-results-col ps-v2-results-col-id-numbers
# ps-v2-results-col ps-v2-results-col-location-date
# ps-v2-results-col ps-v2-results-col-photographer
# Photo
photo = photo.xpath('div[2]/div/a/img/@src')[0].strip()
# Arcraft
try:
aircraft = aircraft.xpath('div[2]/div/div[2]/a/text()')[0].strip()
except IndexError:
aircraft = None
# Reg , MSN
try:
reg,msn = id_number.xpath('div[2]/div/div')
reg = reg.xpath('a/text()')[0].strip()
msn = msn.xpath('a/text()')[0].strip()
except ValueError:
try:
reg = id_number.xpath('div[2]/div/div/a/text()')[0].strip()
except IndexError:
reg = None
msn = None
# Location, Date
city,date = location_date.xpath('div[2]/div/div')
city = city.xpath('a/text()')[0].strip()
try:
country,date = date.xpath('a')
except ValueError:
try:
state,country,date = date.xpath('a')
except ValueError:
state,country,date = (None,None,None)
else:
country = country.xpath('text()')[0].strip()
date = date.xpath('text()')[0].strip()
if state is not None:
state = state.xpath('text()')[0].strip()
# Photographer
photographer = photographer.xpath('div[2]/div/div/div/div/div[1]/a/text()')[0].strip()
# Caption
if caption is not None:
caption = caption.xpath('div[2]/text()')[0].strip()
data.append({
'photo' :photo,
'aircraft' : aircraft,
'reg' : reg,
'msn' : msn,
'city' : city,
'date': date,
'country': country,
'photographer': photographer,
'caption' : caption,
'state' : state
})
print 'LOOP',loop
loop += 1
print "Total " , len(data), "items"
df = pandas.DataFrame(data)
df.to_csv('data.csv',encoding='utf-8',index= False)
日志:
LOOP 1
LOOP 2
LOOP 3
LOOP 4
LOOP 5
LOOP 6
LOOP 7
LOOP 8
LOOP 9
LOOP 10
LOOP 11
LOOP 12
LOOP 13
LOOP 14
LOOP 15
LOOP 16
LOOP 17
LOOP 18
LOOP 19
LOOP 20
LOOP 21
LOOP 22
LOOP 23
LOOP 24
LOOP 25
LOOP 26
LOOP 27
LOOP 28
LOOP 29
LOOP 30
LOOP 31
LOOP 32
LOOP 33
LOOP 34
LOOP 35
LOOP 36
LOOP 37
LOOP 38
LOOP 39
LOOP 40
LOOP 41
LOOP 42
LOOP 43
LOOP 44
LOOP 45
LOOP 46
LOOP 47
LOOP 48
LOOP 49
LOOP 50
LOOP 51
LOOP 52
LOOP 53
LOOP 54
LOOP 55
LOOP 56
Total 2009 items
CSV: