`dplyr`

Question

我使用以下

从R中的Google Analytics（分析）获得了点击流数据

public class SessionList extends AbstractListModel<Object> {

    ArrayList<User> uList = new ArrayList<User>();

public void refresh() {
    fireContentsChanged(this, 0, getSize());
}

public SessionList(ArrayList<User> users) {
    this.uList = users;
    fireContentsChanged(this, 0, getSize());
}

public void add(User element) {
    if (uList.add(element)) {
        fireContentsChanged(this, 0, getSize());
    }
}

public void addAll(User elements[]) {
    Collection<User> c = Arrays.asList(elements);
    uList.addAll(c);
    fireContentsChanged(this, 0, getSize());
}

public void clear() {
    uList.clear();
    fireContentsChanged(this, 0, getSize());
}


}

用户ID在整个数据集中都是重复的，因为用户ID将具有多个页面路径，时间戳甚至SessionID。

我想做的是找到一个包装或某种方式将其放在一个可以与R中的Clickstream包一起使用的数据框中因此结果将如下所示：

R中的哪个函数或程序包可以完成此任务。我不能使用

public class User implements Serializable {

    String name = "Guest";
    int id;
    Socket socket;

public User(String myname) {
    this.name= myname;
}
public User(int id, String name,Socket socket) {
    this.name = name;
    this.socket = socket;
    this.id = id;
}
  }

因为实际上有成千上万的用户ID和路径

我已经探索了public class User implements Serializable { String name = "Guest"; int id; Socket socket; public User(String myname) { this.name= myname; } public User(int id, String name,Socket socket) { this.name = name; this.socket = socket; this.id = id; } }函数和，但是还没有太多祝您好运……必须有一种方法可以将columns: UserID, SessionID, TimeStamp, PagePath, PageViews压缩为单行，然后然后显示页面路径。

我尝试过UserID Column SessionID TimeStamp PagePath PageViews 1 1.1 12:01 google.com 1 1 1.1 12:03 google.com/products 1 1 1.1 12:06 google.com/info 1 1 1.1 12:08 google.com/purchase 1 2 2.1 09:07 google.com 1 2 2.1 09:13 google.com/info 1和UserID PagePathBrokenOut 1 google.com,products,info 2 google.com,info以及c(，但都没有尝试过到目前为止已经工作了

再次确定是否有一种方法可以将多个用户ID合并为一个单数列分成1行，其中的各个路径都很棒。

我尝试使用data.frame，但没有用

dplyr

Answer 1

两种解决方案：

`dplyr`

library(dplyr)
dat %>%
  mutate(Page = gsub("/.*", "", PagePath),
         Path = trimws(gsub("^/|?$", "", gsub("^[^/]*", "", PagePath)))) %>%
  group_by(UserID, Page) %>%
  summarize(PagePathBrokenOut = paste(c(Page[1], Filter(nzchar, Path)), collapse = ",")) %>%
  ungroup()
# # A tibble: 2 x 3
#   UserID Page       PagePathBrokenOut                
#    <int> <chr>      <chr>                            
# 1      1 google.com google.com,products,info,purchase
# 2      2 google.com google.com,info

`data.table`

（注意：我使用magrittr包只是为了打破通话管道，而不是要求这样做。不是。）

library(data.table)
library(magrittr)
datDT <- as.data.table(dat)
datDT %>%
  .[, c("Page", "Path") := .(gsub("/.*", "", PagePath),
                             trimws(gsub("^/|?$", "", gsub("^[^/]*", "", PagePath)))), ] %>%
  .[, .(PagePathBrokenOut = paste(c(Page[1], Filter(nzchar, Path)), collapse = ",")),
    by = c("UserID", "Page")]
#    UserID       Page                 PagePathBrokenOut
# 1:      1 google.com google.com,products,info,purchase
# 2:      2 google.com                   google.com,info

数据：


dat <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
UserID    SessionID    TimeStamp   PagePath             PageViews
1           1.1          12:01      google.com             1
1           1.1          12:03      google.com/products    1
1           1.1          12:06      google.com/info        1
1           1.1          12:08      google.com/purchase    1 
2           2.1          09:07      google.com             1
2           2.1          09:13      google.com/info        1 ")

Answer 2

这是在某些情况下可能有用的另一种方法：

- if this looks like a digit, you can use

数据：

#Remove non required columns and spread
df2 <- df %>%
  select(UserID, PagePath, PageViews) %>%
  spread(PagePath, PageViews)

#Temporal vector to store UserIDs and remove it from df2
UserIDTemp <- df2$UserID
df2$UserID <- NULL

#Populate data frame with URLs instead of page views. NAs will be generated
w <- which(!is.na(df2), arr.ind = TRUE)
df2[w] <- names(df2)[w[, "col"]]

#Paste/concatenate all paths into a single string
df_args <- c(df2, sep = ", ")
pastedPaths <-  do.call(paste, df_args)

#Create data frame with UserIDs and paths
PagePaths <- data.frame(UserIDTemp, pastedPaths)

data.frame(UserIDTemp,pastedPaths)
# UserIDTemp                                                  pastedPaths
# 1 google.com, google.com/info, google.com/products, google.com/purchase
# 2                                   google.com, google.com/info, NA, NA

如何将一列中的重复ID汇总为一行并显示结果

2 个答案:

`dplyr`

`data.table`