作为论文的一部分,我制作了一个网络爬虫。它每分钟运行一次并获得数据-每次迭代中大约有50-200行。
我可以轻松地使用Parquet.NET序列化数据,但是附加它们似乎是不可能的。
在Crawler中执行的第一次序列化在Serializator类中执行追加。
public static void SaveTripsToShema(IEnumerable<SimpleTripData> data)
{
var ct = new DataField<long>("CurrentTime");
var lat = new DataField<double>("Latitude");
var lng = new DataField<double>("Longitude");
var route = new DataField<string>("RouteID");
var trip = new DataField<string>("TripID");
var veichle = new DataField<string>("VeichleID");
var model = new DataField<string>("Model");
var status = new DataField<string>("Status");
var ms = new MemoryStream();
ms.Position = 0;
using (var writer = new ParquetWriter(new Schema(ct, lat, lng, route, trip, veichle, model, status), ms, append:true))
{
using (ParquetRowGroupWriter rg = writer.CreateRowGroup())
{
rg.WriteColumn(new DataColumn(ct, data.Select((x) => x.CurrentTime).ToArray<long>()));
rg.WriteColumn(new DataColumn(lat, data.Select((x) => x.Latitude).ToArray<double>()));
rg.WriteColumn(new DataColumn(lng, data.Select((x) => x.Longitude).ToArray<double>()));
rg.WriteColumn(new DataColumn(route, data.Select((x) => x.RouteID).ToArray<string>()));
rg.WriteColumn(new DataColumn(trip, data.Select((x) => x.TripID).ToArray<string>()));
rg.WriteColumn(new DataColumn(veichle, data.Select((x) => x.VeichleID).ToArray<string>()));
rg.WriteColumn(new DataColumn(model, data.Select((x) => x.Model).ToArray<string>()));
rg.WriteColumn(new DataColumn(status, data.Select((x) => x.Status).ToArray<string>()));
}
}
ms.CopyTo(File.OpenWrite(trip_path));
}
搜寻器遇到了一个问题
System.IO.IOException: 'An attempt was made to move the position before the beginning of the stream.