将大数据集加载到crossfilter / dc.js中

时间:2014-03-10 13:54:19

标签: javascript json d3.js crossfilter dc.js

我构建了一个具有多个维度和组的交叉过滤器,以使用dc.js直观地显示数据。可视化数据是自行车旅行数据,每次旅行都将加载。目前,有超过750,000条数据。我正在使用的JSON文件大70 MB,只需要在未来几个月收到更多数据时增长。

所以我的问题是,如何让数据更精益,以便它可以很好地扩展?现在加载我的互联网连接需要大约15秒钟,但是我担心一旦我有太多的数据需要太长时间。此外,我尝试(不成功)在数据加载时显示进度条/微调器,但我没有成功。

我需要的数据列是start_date, start_time, usertype, gender, tripduration, meters, age。我已将JSON中的这些字段缩短为start_date, start_time, u, g, dur, m, age,因此文件较小。在横向过滤器上,顶部有一个折线图,显示每天的总行程数。下面是星期几的行图(根据数据计算),月份(也计算得出)以及用户类型,性别和年龄的饼图。下面是start_time(向下舍入到小时)和tripduration(向上舍入到分钟)的两个条形图。

该项目位于GitHub上:https://github.com/shaunjacobsen/divvy_explorer(数据集位于data2.json中)。我尝试创建一个jsfiddle,但它不起作用(可能是由于数据,甚至只收集了1,000行并将其加载到带有<pre>标记的HTML中):http://jsfiddle.net/QLCS2/

理想情况下,它的功能是只有顶部图表的数据才能首先加载:这会加载很快,因为它只是白天的数据计数。然而,一旦它进入其他图表,它需要逐步更多的数据来深入细节。关于如何使其发挥作用的任何想法?

3 个答案:

答案 0 :(得分:8)

我建议将JSON中的所有字段名称缩短为1个字符(包括“start_date”和“start_time”)。这应该有点帮助。此外,请确保在服务器上打开压缩。这样,发送到浏览器的数据将在传输过程中自动压缩,如果尚未开启,则可以加快速度。

为了获得更好的响应能力,我还建议您首先设置Crossfilter(空),所有维度和组以及所有dc.js图表​​,然后使用Crossfilter.add()将更多数据添加到Crossfilter中块。最简单的方法是将数据划分为一口大小的块(每个几MB)并连续加载它们。因此,如果您使用的是d3.json,则在前一个文件加载的回调中启动下一个文件加载。这导致了一堆嵌套的回调,这有点令人讨厌,但应该允许用户界面在加载数据时做出响应。

最后,有了这么多数据,我相信你会在浏览器中遇到性能问题,而不仅仅是在加载数据时。我怀疑你已经看到了这个,你看到的15秒暂停至少部分在浏览器中。您可以通过浏览器的开发人员工具进行分析来检查。要解决此问题,您需要分析和识别性能瓶颈,然后尝试优化这些瓶颈。此外 - 如果他们在您的观众中,请务必在速度较慢的计算机上进行测试。

答案 1 :(得分:2)

考虑我的班级设计。它并不匹配你的,但它说明了我的观点。

public class MyDataModel
{
    public List<MyDatum> Data { get; set; }
}

public class MyDatum
{
    public long StartDate { get; set; }
    public long EndDate { get; set; }
    public int Duration { get; set; }
    public string Title { get; set; }
}

开始日期和结束日期是Unix时间戳,持续时间以秒为单位。

序列化为:     &#34; {&#34;数据&#34 ;:
    [{&#34;起始日期&#34;:1441256019,&#34;结束日期&#34;:1441257181,     &#34;持续时间&#34;:451,&#34;标题&#34;:&#34; Rad是一个很酷的词。&#34;},...]}&#34;

一行数据是92个字符。

让我们开始压缩! 将日期和时间转换为60个字符串。 将所有内容存储在一个字符串数组的数组中。

public class MyDataModel
{
    public List<List<string>> Data { get; set; }
}

序列化为:     &#34; {&#34;数据&#34;:[[&#34; 1pCSrd&#34;,&#34; 1pCTD1&#34;,&#34; 7V&#34;,&#34; Rad是一个酷词。&#34;],...]}&#34;

一行数据现在是47个字符。 moment.js是一个处理日期和时间的好图书馆。它具有内置的功能,可以解压缩60格式。

使用数组数组会降低代码的可读性,因此请添加注释以记录代码。

仅加载最近90天。缩放至30天。当用户在范围图表上拖动画笔时,左开始以90天的块为单位提取更多数据,直到用户停止拖动。使用add方法将数据添加到现有的crossfilter。

随着您添加越来越多的数据,您会发现您的图表响应越来越少。那是因为你在svg中渲染了数百甚至数千个元素。浏览器正在被粉碎。使用d3量化功能将数据点分组到存储桶中。将显示的数据减少到50个桶。

量化是值得的,也是您可以使用不断增长的数据集创建可扩展图表的唯一方法。

您的另一个选择是放弃范围图表并将数据月份,日期和小时数分组。然后添加日期范围选择器。由于您的数据按月,日和小时分组,您会发现即使您每天每小时骑自行车,您的结果集也不会超过8766行。

答案 2 :(得分:1)

我观察到类似的数据问题(在企业公司工作),我发现了一些值得尝试的想法。

  1. 您的数据具有常规结构,因此您可以将键放在第一行,只将数据放在后续行中 - 模仿CSV(首先是标题,然后是数据)
  2. 日期时间可以更改为纪元号码(您可以将纪元的开头移至2015年1月1日并计算收到时间
  3. 从服务器(http://oboejs.com/)获取响应时使用oboe.js,因为数据集很大,请考虑在加载期间使用oboe.drop
  4. 使用JavaScript计时器更新可视化
  5. 计时器样本

    var datacnt=0;
    var timerId=setInterval(function () {
        // body...
        d3.select("#count-data-current").text(datacnt);
        //update visualization should go here, something like dc.redrawAll()...
    },300);
    
    oboe("relative-or-absolute path to your data(ajax)")
    .node('CNT',function (count,path) {
        // body...
        d3.select("#count-data-all").text("Expecting " + count + " records");
        return oboe.drop;
    })
    .node('data.*', function (record, path) {
        // body...
        datacnt++;
        return oboe.drop;
    })
    .node('done', function (item, path) {
        // body...
        d3.select("#progress-data").text("all data loaded");
        clearTimeout(timerId);
        d3.select("#count-data-current").text(datacnt);
    });
    

    数据样本

    {"CNT":107498, 
     "keys": "DATACENTER","FQDN","VALUE","CONSISTENCY_RESULT","FIRST_REC_DATE","LAST_REC_DATE","ACTIVE","OBJECT_ID","OBJECT_TYPE","CONSISTENCY_MESSAGE","ID_PARAMETER"], 
     "data": [[22,202,"4.9.416.2",0,1449655898,1453867824,-1,"","",0,45],[22,570,"4.9.416.2",0,1449655912,1453867884,-1,"","",0,45],[14,377,"2.102.453.0",-1,1449654863,1468208273,-1,"","",0,45],[14,406,"2.102.453.0",-1,1449654943,1468208477,-1,"","",0,45],[22,202,"10.2.293.0",0,1449655898,1453867824,-1,"","",0,8],[22,381,"10.2.293.0",0,1449655906,1453867875,-1,"","",0,8],[22,570,"10.2.293.0",0,1449655912,1453867884,-1,"","",0,8],[22,381,"1.80",0,1449655906,1453867875,-1,"","",0,41],[22,570,"1.80",0,1449655912,1453867885,-1,"","",0,41],[22,202,"4",0,1449655898,1453867824,-1,"","",0,60],[22,381,"4",0,1449655906,1453867875,-1,"","",0,60],[22,570,"4",0,1449655913,1453867885,-1,"","",0,60],[22,202,"A20",0,1449655898,1453867824,-1,"","",0,52],[22,381,"A20",0,1449655906,1453867875,-1,"","",0,52],[22,570,"A20",0,1449655912,1453867884,-1,"","",0,52],[22,202,"20140201",2,1449655898,1453867824,-1,"","",0,40],[22,381,"20140201",2,1449655906,1453867875,-1,"","",0,40],[22,570,"20140201",2,1449655912,1453867884,-1,"","",0,40],[22,202,"16",-4,1449655898,1453867824,-1,"","",0,58],[22,381,"16",-4,1449655906,1453867875,-1,"","",0,58],[22,570,"16",-4,1449655913,1453867885,-1,"","",0,58],[22,202,"512",0,1449655898,1453867824,-1,"","",0,57],[22,381,"512",0,1449655906,1453867875,-1,"","",0,57],[22,570,"512",0,1449655913,1453867885,-1,"","",0,57],[22,930,"I32",0,1449656143,1461122271,-1,"","",0,66],[22,930,"20140803",-4,1449656143,1461122271,-1,"","",0,64],[14,1359,"10.2.340.19",0,1449655203,1468209257,-1,"","",0,131],[14,567,"10.2.340.19",0,1449655185,1468209111,-1,"","",0,131],[22,930,"4.9.416.0",-1,1449656143,1461122271,-1,"","",0,131],[14,1359,"10.2.293.0",0,1449655203,1468209258,-1,"","",0,13],[14,567,"10.2.293.0",0,1449655185,1468209112,-1,"","",0,13],[22,930,"4.9.288.0",-1,1449656143,1461122271,-1,"","",0,13],[22,930,"4",0,1449656143,1461122271,-1,"","",0,76],[22,930,"96",0,1449656143,1461122271,-1,"","",0,77],[22,930,"4",0,1449656143,1461122271,-1,"","",0,74],[22,930,"VMware ESXi 5.1.0 build-2323236",0,1449656143,1461122271,-1,"","",0,17],[21,616,"A20",0,1449073850,1449073850,-1,"","",0,135],[21,616,"4",0,1449073850,1449073850,-1,"","",0,139],[21,616,"12",0,1449073850,1449073850,-1,"","",0,138],[21,616,"4",0,1449073850,1449073850,-1,"","",0,140],[21,616,"2",0,1449073850,1449073850,-1,"","",0,136],[21,616,"512",0,1449073850,1449073850,-1,"","",0,141],[21,616,"Microsoft Windows Server 2012 R2 Datacenter",0,1449073850,1449073850,-1,"","",0,109],[21,616,"4.4.5.100",0,1449073850,1449073850,-1,"","",0,97],[21,616,"3.2.7895.0",-1,1449073850,1449073850,-1,"","",0,56],[9,2029,"10.7.220.6",-4,1470362743,1478315637,1,"vmnic0","",1,8],[9,1918,"10.7.220.6",-4,1470362728,1478315616,1,"vmnic3","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315616,1,"vmnic2","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315615,1,"vmnic1","",1,8],[9,1918,"10.7.220.6",-4,1470362727,1478315615,1,"vmnic0","",1,8],[14,205,"934.5.45.0-1vmw",-50,1465996556,1468209226,-1,"","",0,47],[14,1155,"934.5.45.0-1vmw",-50,1465996090,1468208653,-1,"","",0,14],[14,963,"934.5.45.0-1vmw",-50,1465995972,1468208526,-1,"","",0,14],
     "done" : true}
    

    首先将键更改为完整对象数组的示例

        //function to convert main data to array of objects
        function convertToArrayOfObjects(data) {
            var keys = data.shift(),
                i = 0, k = 0,
                obj = null,
                output = [];
    
            for (i = 0; i < data.length; i++) {
                obj = {};
    
                for (k = 0; k < keys.length; k++) {
                    obj[keys[k]] = data[i][k];
                }
    
                output.push(obj);
            }
    
            return output;
        }
    

    上面的这个函数适用于一点修改版本的数据 在这里取样

       [["ID1","ID2","TEXT1","STATE1","DATE1","DATE2","STATE2","TEXT2","TEXT3","ID3"],
        [14,377,"2.102.453.0",-1,1449654863,1468208273,-1,"","",0,45],
        [14,406,"2.102.453.0",-1,1449654943,1468208477,-1,"","",0,45],
        [22,202,"10.2.293.0",0,1449655898,1453867824,-1,"","",0,8],
        [22,381,"10.2.293.0",0,1449655906,1453867875,-1,"","",0,8],
        [22,570,"10.2.293.0",0,1449655912,1453867884,-1,"","",0,8],
        [22,381,"1.80",0,1449655906,1453867875,-1,"","",0,41],
        [22,570,"1.80",0,1449655912,1453867885,-1,"","",0,41],
        [22,202,"4",0,1449655898,1453867824,-1,"","",0,60],
        [22,381,"4",0,1449655906,1453867875,-1,"","",0,60],
        [22,570,"4",0,1449655913,1453867885,-1,"","",0,60],
        [22,202,"A20",0,1449655898,1453867824,-1,"","",0,52]]
    

    另外考虑使用memcached https://memcached.org/或redis https://redis.io/来缓存服务器端的数据,根据数据大小,redis可能会让你更进一步