在嵌套的小标题列上计算函数?

时间:2018-06-24 09:53:33

标签: r dataframe tibble

我有一个带有一列小标题的数据框。 这是我的一部分数据:

<!DOCTYPE html>
<meta charset="utf-8">
<style>

* {
    margin: 0;
    padding: 0;
}

html, body {
    width: 100%;
    height: 100%;
    border: 0;
    overflow: hidden;
    display: block;
}

.box, .reset, .x, .y, .board {
    display: block;
}
.box {
    position: relative;
    width: 100%;
    height: 100%;
}
.reset {
    position: absolute;
    left: 0;
    top: 0;
    width: 20px;
    height: 20px;
    border-right: 1px solid #E5E5E5;
    border-bottom: 1px solid #E5E5E5;
    background-color: #FCFCFC;
    z-index: 100;
}
.x {
    position: absolute;
    left: 20px;
    top: 0;
    width: calc(100% - 20px);
    height: 20px;
    background-color: #FCFCFC;
}
.y {
    position: absolute;
    left: 0;
    top: 20px;
    width: 20px;
    height: calc(100% - 20px);
    background-color: #FCFCFC;
}
.board {
    position: absolute;
    left: 0;
    top: 0;
    width: calc(100% - 20px);
    height: calc(100% - 20px);
}

#board {
    padding-left: 20px;
    padding-top: 20px;
}

body {
  font: 10px sans-serif;
  shape-rendering: crispEdges;
  background-color: #E5E5E5;
}

path.domain {
    stroke: none;
}

g.tick line {
    stroke: #D3D3D3;
    stroke-width: 2;
}

g.tick text {
    fill: #C4C4C4;
}

</style>
<body>
    <div class="box">
        <div class="reset"></div>
        <div class="x"></div>
        <div class="y"></div>
        <div class="board"></div>
    </div>
<script src="https://d3js.org/d3.v5.min.js"></script>
<script>

function isInteger(num) {
    return (num ^ 0) === num;
}

var width = window.innerWidth
    || document.documentElement.clientWidth
    || document.body.clientWidth;
var height = window.innerHeight
    || document.documentElement.clientHeight
    || document.body.clientHeight;

var svg = d3.select(".board").append("svg")
    .attr("id", "board")
    .attr("width", width - 20)
    .attr("height", height - 20)
    // .append("g");

var zoom = d3.zoom()
    // .scaleExtent([1, 32])
    .on("zoom", zoomed);

svg.call(zoom);

x_scale = d3.scaleLinear().domain([0, 20]).range([0, width]);
y_scale = d3.scaleLinear().domain([0, 20]).range([0, height]);

    var x_axis = d3.axisTop(x_scale)
        .ticks(?) //I dont know what do I have to write
        .tickFormat(function(d, i) {
            if (isInteger(d)) {
                return d;
            }
        })
        .tickSize(5)
    var x_axis_group = svg.append("g")
        .attr("transform", "translate(20, 0)")
        .call(x_axis);

    var y_axis = d3.axisLeft(y_scale)
        .ticks(?) /I dont know what do I have to write
        .tickFormat(function(d, i) {
            if (isInteger(d)) {
                return d;
            }
        })
        .tickSize(5);
    var y_axis_group = svg.append("g")
        .attr("transform", "translate(0, 20)")
        .call(y_axis);

    function zoomed() {
        var new_x_scale = d3.event.transform.rescaleX(x_scale);
        var new_y_scale = d3.event.transform.rescaleY(y_scale);

        x_axis_group.call(x_axis.scale(new_x_scale));
        y_axis_group.call(y_axis.scale(new_y_scale));
    }
</script>

这是我的功能:

date        time        uuid                data
2018-06-23  18:25:24    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:25:38    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:01    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:23    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:26:37    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:00    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:22    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:27:39    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:06    0b27ea5fad61c99d    <tibble>    
2018-06-23  18:28:30    0b27ea5fad61c99d    <tibble>

我的数据列由带有一列字符的小标题组成:

jaccard <- function(vector1, vector2) {

  return(length(intersect(vector1, vector2)) / 
        length(union(vector1, vector2)))

}

我的目标是在数据列中的每两个连续小节之间计算jaccard。

我尝试过:

contacts 5646 65748 115 498456 35135 ,但由于某些原因似乎无效。

我知道我已经接近了,请告知。

1 个答案:

答案 0 :(得分:1)

原因是jaccard函数未编写为处理矢量参数。如您所知,用作mutate一部分的函数接收数据向量(在OP的示例中为10 tibbles的向量)。现在,由于jaccard函数未编写为处理vector(小节的向量)的参数,因此结果将不符合预期。

最简单的解决方法是对jaccard函数进行矢量化处理,使其可以处理矢量参数。一次可以使用Vectorize将函数转换为:

# Function 
jaccard <- function(vector1, vector2) {
  return(length(intersect(vector1, vector2)) / 
           length(union(vector1, vector2)))
}
# Vectorised version of jaccard function
jaccardV <- Vectorize(jaccard)


library(dplyr)
df %>%
  mutate(j = jaccardV(data, lag(data, 1)))

#          date     time             uuid                            data         j
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.0000000
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646 0.3333333
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748 0.5000000
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135 0.4000000
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748 0.4000000
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115 0.6666667
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646 0.3333333
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748 0.5000000

数据:

df <- read.table(text="
date        time        uuid                
2018-06-23  18:25:24    0b27ea5fad61c99d    
2018-06-23  18:25:38    0b27ea5fad61c99d    
2018-06-23  18:26:01    0b27ea5fad61c99d    
2018-06-23  18:26:23    0b27ea5fad61c99d    
2018-06-23  18:26:37    0b27ea5fad61c99d    
2018-06-23  18:27:00    0b27ea5fad61c99d    
2018-06-23  18:27:22    0b27ea5fad61c99d    
2018-06-23  18:27:39    0b27ea5fad61c99d    
2018-06-23  18:28:06    0b27ea5fad61c99d    
2018-06-23  18:28:30    0b27ea5fad61c99d",
header = TRUE, stringsAsFactors = FALSE)

t1 <- tibble(contacts = c(5646,65748,115,498456,35135))
t2 <- tibble(contacts = c(5646,65748))
t3 <- tibble(contacts = c(5646,65748,115))
t4 <- tibble(contacts = c(5646))
t5 <- tibble(contacts = c(5646,65748))


df$data <- c(t1,t2,t3,t4,t5)

df
#          date     time             uuid                            data
# 1  2018-06-23 18:25:24 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 2  2018-06-23 18:25:38 0b27ea5fad61c99d                     5646, 65748
# 3  2018-06-23 18:26:01 0b27ea5fad61c99d                5646, 65748, 115
# 4  2018-06-23 18:26:23 0b27ea5fad61c99d                            5646
# 5  2018-06-23 18:26:37 0b27ea5fad61c99d                     5646, 65748
# 6  2018-06-23 18:27:00 0b27ea5fad61c99d 5646, 65748, 115, 498456, 35135
# 7  2018-06-23 18:27:22 0b27ea5fad61c99d                     5646, 65748
# 8  2018-06-23 18:27:39 0b27ea5fad61c99d                5646, 65748, 115
# 9  2018-06-23 18:28:06 0b27ea5fad61c99d                            5646
# 10 2018-06-23 18:28:30 0b27ea5fad61c99d                     5646, 65748