我有一个数据集,其中包含有关控制系统故障的数据。这些数据具有以下结构:
TYPE OF FAILURE (string), START DATE (dd/mm/yyyy), START TIME (hh/mm/ss), DURATION (ss), LOCALIZATION (string), WORKING TEAM (A,B,C), SHIFT (morning, afternoon, night)
包含数据的表有555000行。 首先,我想分析是否存在与START DATE参数相关的重复失败序列。基本上,我想找到这样的东西:
失败1于3月10日出现。失败2于3月15日出现。他们之间有5天。然后失败1出现在4月10日和4月15日,他们之间也是5天。比失败1在5月10日和5月15日出现,它们之间也有5天。然而失败1也可能在不同的日期出现,但对我来说有趣的是,有更强的可能性,失败2将在失败1后5天出现,并且这些事件之间(F1-> F2)是一个月
我不知道我的解释是否足够清楚。然而,我正在寻找合适的方法/算法,我将能够从上面的数据描述中提取这些序列。你能指点一些方法吗?或者简单地让我们一起集思广益:)任何帮助表示赞赏。
PS:我计划在C#或MATLAB中实现它(取决于合适的方法) 谢谢。
答案 0 :(得分:0)
您的文件看起来像一个大的CSV,因为该matlab在数据存储中具有良好的实现
https://es.mathworks.com/help/matlab/import_export/what-is-a-datastore.html
有这个工具可以处理大文件:
https://es.mathworks.com/help/matlab/large-files-and-big-data.html
还要看看工作with tables in matlab
在你的情况下,你可以这样做:
样本文件airlinessmall.csv(123524行)
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
1987,10,21,3,642,630,735,727,PS,1503,NA,53,57,NA,8,12,LAX,SJC,308,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,26,1,1021,1020,1124,1116,PS,1550,NA,63,56,NA,8,1,SJC,BUR,296,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,23,5,2055,2035,2218,2157,PS,1589,NA,83,82,NA,21,20,SAN,SMF,480,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,23,5,1332,1320,1431,1418,PS,1655,NA,59,58,NA,13,12,BUR,SJC,296,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,22,4,629,630,746,742,PS,1702,NA,77,72,NA,4,-1,SMF,LAX,373,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,28,3,1446,1343,1547,1448,PS,1729,NA,61,65,NA,59,63,LAX,SJC,308,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,8,4,928,930,1052,1049,PS,1763,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,10,6,859,900,1134,1123,PS,1800,NA,155,143,NA,11,-1,SEA,LAX,954,NA,NA,0,NA,0,NA,NA,NA,NA,NA
...
使用数据存储tou可以将数据作为表格使用并获取所需的变量,例如得到到达延迟的意思:
>> ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
>> ds.MissingValue = 0;
>> ds.SelectedVariableNames = 'ArrDelay';
>> data = preview(ds)
data =
ArrDelay
________
8
8
21
13
4
59
3
11
>> data % this is a table
data =
ArrDelay
________
8
8
21
13
4
59
3
11
>> sums = [];
counts = [];
while hasdata(ds)
T = read(ds); % this is a table, but this is not all loaded in memory
sums(end+1) = sum(T.ArrDelay);
counts(end+1) = length(T.ArrDelay);
end
>> avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay =
6.9670
让我们使用您的样本。检查这个文件:
sample.csv
TYPE OF FAILURE, START DATE, START TIME, DURATION, LOCALIZATION, WORKING TEAM, SHIFT
failure 1, 06/01/2017, 12/13/20, 300, Area 1, A, morning
failure 2, 06/01/2017, 12/13/20, 300, Area 1, A, night
failure 3, 06/01/2017, 12/13/20, 400, Area 1, A, afternoon
failure 1, 08/01/2017, 12/13/20, 300, Area 1, A, morning
failure 2, 09/01/2017, 12/13/20, 300, Area 1, A, morning
failure 3, 09/01/2017, 12/13/20, 300, Area 1, A, night
failure 3, 09/01/2017, 14/13/20, 200, Area 1, A, morning
failure 1, 10/01/2017, 12/13/20, 300, Area 1, A, morning
failure 1, 12/01/2017, 12/13/20, 300, Area 1, A, afternoon
failure 2, 12/01/2017, 12/13/20, 500, Area 1, A, morning
failure 1, 14/01/2017, 12/13/20, 300, Area 1, A, night
你可以看到失败1是每两天让我们看到这个:
>> ds = tabularTextDatastore('sample.csv')
Warning: Variable names were modified to make them valid MATLAB identifiers.
ds =
TabularTextDatastore with properties:
Files: {
'/home/anquegi/learn/matlab/stackoverflow/sample.csv'
}
FileEncoding: 'UTF-8'
ReadVariableNames: true
VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
Text Format Properties:
NumHeaderLines: 0
Delimiter: ','
RowDelimiter: '\r\n'
TreatAsMissing: ''
MissingValue: NaN
Advanced Text Format Properties:
TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
ExponentCharacters: 'eEdD'
CommentStyle: ''
Whitespace: ' \b\t'
MultipleDelimitersAsOne: false
Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
ReadSize: 20000 rows
>> ds.SelectedVariableNames = {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME', 'DURATION', 'LOCALIZATION', 'WORKINGTEAM', 'SHIFT'}
ds =
TabularTextDatastore with properties:
Files: {
'/home/anquegi/learn/matlab/stackoverflow/sample.csv'
}
FileEncoding: 'UTF-8'
ReadVariableNames: true
VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
Text Format Properties:
NumHeaderLines: 0
Delimiter: ','
RowDelimiter: '\r\n'
TreatAsMissing: ''
MissingValue: NaN
Advanced Text Format Properties:
TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
ExponentCharacters: 'eEdD'
CommentStyle: ''
Whitespace: ' \b\t'
MultipleDelimitersAsOne: false
Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
ReadSize: 20000 rows
>> reset(ds)
accum = [];
while hasdata(ds)
T = read(ds);
accum = datetime(T(strcmp(T.TYPEOFFAILURE,'failure 1'),:).STARTDATE, 'InputFormat','dd/MM/yyyy');
mean(diff(accum))
end
ans =
48:00:00
%恰好每48小时一次,然后你可以试试你想要的每件事