Question

我正在尝试使用R（SPADE）中的频繁序列。我有以下数据集：

d1 <- c(1:10)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")

   day widget status
1    1    nut      c
2    2    nut      b
3    3    nut      b
4    4    nut      b
5    5    nut      a
6    6    nut      a
7    7    nut      b
8    8    nut      c
9    9    nut      c
10  10    nut      b
11   1   bolt      a
12   2   bolt      b
...

我无法将数据转换为似乎可以与各种可用软件包兼容的格式。我认为基本问题是大多数程序包都希望将序列与身份和事件相关联。就我而言，不存在。

我想回答以下问题：

如果在任何一天中widget [bolt]的状态为“ a”，而widget [screw]的状态为“ c”，第二天widget [screw]的状态为“ b”，那么在第三天widget [nut] ]可能是“ a”。

因此，没有要使用的身份或事务/事件。我是否使这个问题复杂化了？还是有一个非常合适的包装。到目前为止，我已经尝试过arulesSequence和TraMineR。

谢谢

Answer 1

我想您会发现，通过将数据从长到远重塑然后实施逻辑测试，可以最轻松地解决此类问题。例如：

# reshape from long to wide
data2 <- reshape2::dcast(data, day ~ widget)

# get the next-rows's value for "nut"
data2$next_nut <- dplyr::lead(data2$nut)

# implement your test 
data2$bolt == "a" & data2$screw == "c" & data2$next_nut == "a"

Answer 2

此处的关键是根据目标重塑数据集。您必须确保每一行都具有所有输入信息（您的条件/条件）和目标变量（要查找的内容）。

根据您描述的问题：

输入信息是“给定日期的widget [螺栓]值，同一天的widget [螺丝]值，以及后一天的widget [screw]值”，因此您需要确保新数据集的每一行都有此信息。

目标信息是“第三天小部件[nut]值”。

# for reproducibility reasons
set.seed(16)  

# example dataset
d1 <- c(1:100)
d2 <- c("nut", "bolt", "screw")
data <- data.frame(expand.grid(d1,d2))
data$status <- sample(c("a","b","c"), size = nrow(data), replace = TRUE)
colnames(data) <- c("day", "widget", "status")

library(tidyverse)

data %>% 
  spread(widget, status) %>%             # reshape data
  mutate(screw_next_1 = lead(screw),     # add screw next day
         nut_next_2 = lead(nut, 2)) %>%  # add nut 2 days after (target variable)
  filter(bolt == "a" & screw == "c" & screw_next_1 == "b") # get rows that satisfy your criteria

#   day nut bolt screw screw_next_1 nut_next_2
# 1   8   c    a     c            b          a
# 2  19   c    a     c            b          c
# 3  62   c    a     c            b          c
# 4  97   c    a     c            b          b

通过简单的计算，您可以说，根据数据，给定条件，您有nut = a的第三天的概率为1/4。

Answer 3

不确定要做什么。如果您想使用TraMineR，可以使用以下方法假设小部件是序列ID来输入数据：

library(TraMineR)

## Transforming into the STS form expected by seqdef()
sts.data <- seqformat(data, from="SPELL", to="STS", id="widget", 
                      begin="day", end="day", status="status",
                      limit=10)

## Setting position names and sequence names
names(sts.data) <- paste0("d",rep(1:10))
rownames(sts.data) <- d2
sts.data
#       d1 d2 d3 d4 d5 d6 d7 d8 d9 d10
# nut    b  a  b  b  b  a  c  a  a   a
# bolt   c  b  a  b  a  c  b  a  c   c
# screw  a  b  a  a  c  c  b  b  b   c

## Creating the state sequence object
sseq <- seqdef(sts.data)

## Potting the sequences
seqiplot(sseq, ytlab="id", ncol=3)

R序列中的模式序列和事件问题

3 个答案: