内部联接的替代方法,用于过滤数据表

时间:2016-03-14 15:36:46

标签: c# linq

我有一个数据表,我希望每天选择第一个条目,其中所有curveIDs都存在。我能想到的唯一方法是使用连接,因为它只会在两个数据集都存在的情况下自动连接。

这是我到目前为止所做的:

//core data from sql (I have little control over this)
DataTable ds = new DataTable(); 
da.Fill(ds);

//creating dataset with various tables based on curveIDs I look for
System.Data.DataSet dataSet = new System.Data.DataSet();
            for (int i = 0; i < curveIds.Length; i++)
            {
                    dataSet.Tables.Add(ds.AsEnumerable().Where(x => x.Field<short>("curveID") == curveIds[i]).CopyToDataTable());
            }

//lets say I have two only and then I join them like this to match timestamps correctly
var result = from table1 in dataSet.Tables[0].AsEnumerable()
                         join table2 in dataSet.Tables[1].AsEnumerable() 
                         on table1["Timestamp"] equals table2["Timestamp"]
                         select new
                         {
                             Timestamp = (DateTime)table1["Timestamp"],
                             Spread = (double)table1["mid"] - 0.4 * (double)table2["mid"],
                             Power = (double)table1["mid"]
                         };

//lastly I do a firstordefault over the data as I only want the first timestamp where both are present (this step doesnt return the correct data)
var endres = result.OrderBy(a => a.Timestamp).GroupBy(a => a.Timestamp.ToShortDateString()).FirstOrDefault().ToList();

这看起来很复杂。最后一步也不会在清晨每天返回一个记录集,而是在一天内返回许多数据集。

在完整的问题中,我必须为4-6 curveIDs执行此操作,这意味着我必须执行可变数量的连接,这使得此方法不可行。

源数据在工作日的上午8点到下午4点之间的每分钟都有列(TimestampCurveIDMid,但不能保证所有curveIDs实际上都是每个时间戳都有。

让我们在第1天8:01说所有的ID都在那里(第一次是真实的但不仅仅是这样)而且在第二天只有8:03都有ID,那么返回数据应该是:

Day1 8:01, spread =x, Power=y
Day2 8:03, spread =z, Power=a
...

......依此类推,每天只有一个条目被选为第一个所有ID都存在的条目。

3 个答案:

答案 0 :(得分:1)

如果我理解得很好,你想要找到每天的最低时间戳(在你拥有的数据表中),它包含所有&#34; curveIDs&#34;你的curveID列表?

如果是这样,那么我写了一个可能解决它的代码。如果有错误,请在评论中告诉我。使用列表比设置数据表更容易理解。所以我只是用你了#34; ds&#34;数据表并构建了一个indenpendt代码。

还有优化,但这会让代码更难理解。

DataTable ds = new DataTable();
List<int> curveIds = new List<int>() {1,2,3,4};

public void Test()
{
    LoadDs();

    List<object> endress = new List<object>();

    //filter all timestamps, getting only the date info
    var timeStamps = ds.AsEnumerable().Select(r=> ((DateTime)r["Timestamp"]).Date).Distinct();

    //for each id
    foreach (var timeStamp in timeStamps)
    {
        //find all the same timestamp (on the same day)
        var listSameTimestamp = ds.AsEnumerable().Where(r => ((DateTime)r["Timestamp"]).Date == timeStamp);

        var listIds = listSameTimestamp.Select(r => (int)r["curveID"]).Distinct();

        //ensure they all have the curveIDs you are looking for
        var haveThemAll = curveIds.Intersect(listIds).Count() == curveIds.Count();

        if (haveThemAll == false)
            continue;

        //find the lowest timestamp
        var rowFound = listSameTimestamp.OrderBy(r => (DateTime)r["Timestamp"]).FirstOrDefault();
        if (rowFound == null)
            continue;

        //create an anonymous object (coud not understand your needs)
        endress.Add(new
        {
            Timestamp = (DateTime)rowFound["Timestamp"],
            Spread = (double)rowFound["mid"] - 0.4 * (double)rowFound["mid"],
            Power = (double)rowFound["mid"]
        });                   
    }


    foreach (var o in endress)
    {
        Console.WriteLine(o);
    }
}

public void LoadDs()
{
    ds = new DataTable();
    ds.Columns.Add("curveID",typeof(int));
    ds.Columns.Add("Timestamp", typeof(DateTime));
    ds.Columns.Add("mid", typeof(double));

    for (int i = 0; i < 50000; i++)
    {
        Random rand = new Random(i);
        var row = ds.NewRow();
        row["curveID"] = rand.Next(1,5);
        row["Timestamp"] = new  DateTime(2016,4, rand.Next(1,5), rand.Next(1,3), 0,0);
        row["mid"] = rand.NextDouble();

        ds.Rows.Add(row);
    }
}

这是&#34;主要&#34;分段。但是你可以在这里看到完整的测试代码:

heroku run rake db:setup 

答案 1 :(得分:1)

如果我理解正确:
 你有一张带有时间戳,曲线,中间列的表格  2.时间戳(至少通常是)每分钟,并非所有曲线都保证存在  3.您希望使用存在所有必需曲线的第一个时间戳的行来计算点差,功率

我建议这样的事情:

// I'll pretend the curveids are in this list...
List<double> curveids = new List<double>();

DataTable table = ds.Tables["Your table"];

// first get a grouping of timestamps for the day containing all curveids
// setup mindate and maxdate of your choosing...
var grouping = table.AsEnumerable()
    .Where(x => curveids.Contains(x.curveid) && 
                x.timestamp > mindate &&
                x.timestamp < maxdate)
    .GroupBy(x => x.timestamp);
// this gives a grouping of IEnumerable<IGrouping<DateTime, YourRowType>> 
// i.e. timestamps, and group of rows for each with curveids in your selection

// Now get the minimum timestamp, where all curve ids are present..
DateTime minTimestamp = grouping.Where(x => x.Count(y => y.curveid) == curveids.Count)
                                .Select(x => x.Key).Min();

// .. now can do what you wish with that...
// For example:
var resultRows = table.AsEnumerable().Where(x => 
                    x.timestamp == minTimestamp &&
                    curveids.Contains(x.Close));

现在你可以使用resultRows并根据公式

计算点差,功率等等

答案 2 :(得分:0)

以下是我的看法:

    //selecting into an object for better readability and access
    var result = dt.AsEnumerable().Select(r => new
    {
        TimeStamp = r.Field<DateTime>("TimeStamp"),
        CurveID = r.Field<short>("CurveId"),
        Mid = r.Field<double>("Mid")
    })
    // ignoring rows with different curve ID than in the list
    .Where(item => ids.Contains(item.CurveID))
    // grouping by timestamp
    .GroupBy(item => item.TimeStamp)
    // selecting only groups that have all curve Ids
    .Where(g => g.Select(i=>i.CurveID).Distinct().Count() == ids.Count)
    // grouping the groups by date
    .GroupBy(g => g.Key.Date)
    .Select(g2 =>
    {
        // getting the first timestamp group by timestamp
        var min = g2.OrderBy(i => i.Key).First();
        // getting all the Mid values
        var values = min.Select(i => i.Mid).ToList();
        // returning the desired computation
        return new
        {
            TimeStamp = min.Key,
            Spread = spread(values),
            Power = power(values)
        };
    })
    .ToList();

我对问题文本和现有评论的假设是:

  • 时间戳必须完全相同才能考虑记录的所有曲线ID
  • 忽略列表中不在列表中的记录
  • 仅中间值和所有曲线ID的最小时间戳 满足每一天的事情到最终结果

我必须补充一点,这不是最有效的方法,因为有几个遍历数据:首先按时间戳分组并按ID过滤,然后按curveids过滤,然后按日期分组终于到了第一个当天的第一个时间戳。一个更快但不太可读的实现将首先排序,然后只传递一次通过每个项目。