我正在开展与飞行数据集相关的项目。我有一个格式如下的数据框:它有飞行号码,运营商名称,来源,目的地,运营商延迟,天气延迟,延迟时间,安全延迟和飞机延误详情,以分钟为单位。
FL_NUM CARRIER ORIGIN DEST carr_del weather_del nas_del sec_del aircraft_del
1 AA JFK LAX 0 0 0 0 0
1 AS DCA SEA 0 0 0 0 0
1 B6 JFK FLL 12 0 12 0 0
1 HA LAX HNL 405 0 5 0 0
1 VX SFO DCA 24 20 50 0 0
1 WN ATL MDW 0 0 0 0 0
1 WN DAL HOU 27 0 0 0 0
我使用cypher查询在Neo4j中建立了如下关系:
MERGE (origin:origin_airport {name: row.ORIGIN})
MERGE (destination:dest_airport {name: row.DEST})
MERGE (carrier:Carrier {name: row.UNIQUE_CARRIER})
MERGE (flight:Flight {name: row.FL_NUM})
MERGE (flight)-[:from {flnum: row.FL_NUM}]->(origin)
MERGE (flight)-[:to {flnum: row.FL_NUM}]->(destination)
MERGE (flight)-[:operated_by {carrier: row.UNIQUE_CARRIER}]->(carrier)
MERGE (origin)-[r:delayed_by]->(destination)
SET r.carr_delay=row.carr_delay, r.weather_delay=row.weather_delay,
r.nas_delay=row.nas_delay, r.sec_delay=row.sec_delay,
r.aircraft_delay=row.aircraft_delay
MERGE (flight)-[r1:delayed_by]->(origin)
SET r1.carr_delay=row.carr_delay, r1.weather_delay=row.weather_delay,
r1.nas_delay=row.nas_delay, r1.sec_delay=row.sec_delay,
r1.aircraft_delay=row.aircraft_delay
")
关系是:
1) Flight number linked to origin airport(ORIGIN)
2) Flight number linked to destination airport(DEST)
3) Flight number linked to Unique carrier
4) Origin airport linked by delay to destination airport.
Delay parameter holds the value of carrier delay, weather delay, nas delay,
security and late aircraft delay
5) Flight linked by delay to origin airport
Here again, delay parameter holds the value of carrier delay, weather delay,
nas delay, security and late aircraft delay
在这里,我希望回答十大运营商的问题 - 领先的延迟类型。
我使用以下代码来获得有关航班的十大航空公司。
MATCH (f:Flight)-[:operated_by]->(c:Carrier)
WITH c, COUNT(f) AS flights
RETURN c.name,flights
ORDER BY flights DESC
LIMIT 10
我需要将其用于下一步并计算与每个运营商相关的最大延迟。在这里,我有以分钟为单位指定的延迟值,我的查询需要计算哪个延迟具有更高的值并返回该特定载波的延迟名称。
从示例中,如果您注意到HA,carr_del具有更高的值,因此输出应该是:
Carrier Cause of delay
HA Carrier delay
VX nas delay
是否可以在Neo4j中使用cypher查询?或者我应该改变关系的结构?
如果上述结果很复杂,是否有可能获得至少任何特定延迟的顶级运营商,比如载波延迟?这里载波延迟具有所有载波的值,并且它应该基于最高值返回载波。 我知道它开始有点像下面,但不知道如何结束。
MATCH (c)<-[:operated_by]-(:Flight)-[r1:DELAYED_BY]
有人可以帮助我吗?
答案 0 :(得分:1)
1)我认为你在模型中有错误(保留冗余数据,丢失航班信息,执行特定的运营商。)应该是这样的:
MERGE (carrier:Carrier {name: row.UNIQUE_CARRIER})
MERGE (flight:Flight {name: row.FL_NUM})
MERGE (destination:Airport {name: row.DEST})
MERGE (origin:Airport {name: row.ORIGIN})
MERGE (origin)-[:from]->(flight)-[:to]->(destination)
MERGE (flight)-[:flight_details]->
// Stores information about the flight, perform a specific carrier
(:FlightByCarrierDetails {
name: 'Detail of ' + flight.name + ' by ' + carrier.name,
carr_del: row.carr_del, weather_del: row.weather_del,
nas_del: row.nas_del, sec_del: row.sec_del, aircraft_del: row.aircraft_del})
-[:operated_by]->(carrier)
2)然后你的第一个查询是:
MATCH (f:Flight)
-[:flight_details]->(:FlightByCarrierDetails)
-[:operated_by]->(c:Carrier)
RETURN c.name as `Carrier name`, COUNT(f) AS flights
ORDER BY flights DESC LIMIT 10
3)并且搜索频繁的延迟原因是:
MATCH (f:Flight)
-[:flight_details]->(d:FlightByCarrierDetails)
-[:operated_by]-(c:Carrier)
WITH c,
// reasons of delay
{carr: SUM(d.carr_del), weather: SUM(d.weather_del),
nas: SUM(d.nas_del), sec: SUM(d.sec_del),
aircraft: SUM(d.aircraft_del)} as rD
UNWIND [rD.carr, rD.weather, rD.nas, rD.sec, rD.aircraft] as delay
WITH c, rD, max(delay) as mD
RETURN c.name as `Carrier name`,
REDUCE ( acc=0, r in keys(rD) | acc + rD[r] ) as `Total delay`,
FILTER(r in keys(rD) WHERE rD[r]>=mD) as `Cause of delay`
ORDER BY `Total delay` DESC