基于矢量元素的选择

时间:2019-05-10 23:13:56

标签: r dataframe

我非常陌生,请提前抱歉。我有两个向量,一个是帐户名称的字符向量(30),另一个是产品名称的字符向量(30)。最后,我有一个数据框,其中包含三列客户名称,产品名称和收入,但是这个列表远远超出了其中的30个。

最终,我需要一个30x30的数据框行作为产品名称向量中的产品,列作为帐户名称向量中的帐户名,并将值作为与该列中的帐户和该行中的产品相关联的收益。

我认为我需要嵌套循环功能吗?但是我不知道如何使用它来适当地填充数据框。

account<-c("a","b",etc)

product<-c("prod_a","prod_b", etc)

for(i in 1:length(account)){

    for(i in 1:length(product)){

.....
}
}

老实说我只是迷路了哈哈

1 个答案:

答案 0 :(得分:0)

思考我知道您要在这里做什么。我怀疑您有充分的理由想要这种30x30的交叉表类型的结构,但是我也想借此机会鼓励"tidy" data进行分析。对于要视为“整洁”的数据,可以通过以下三个主要标准来概括该链接:

  1. 每个变量构成一列。

  2. 每个观察结果都排成一行。

  3. 每种类型的观测单位组成一个表格。

也就是说,以下是我试图解释和演示我认为您想要实现的目标。

library(tidyr)

# set up some fake data to better explain
account_vec <- paste0(letters, 1:26)
product_vec <- paste0(as.character(101:126), LETTERS)
revenue_vec <- rnorm(26*26)

# permutating accounts and products to set up our fake data
df <- expand.grid(account_vec, product_vec)
names(df) <- c("accounts", "products")
df$revenue <- revenue_vec

# if this is what your data looks like currently, I would consider this fairly "tidy"


# now let's pretend there's some data we need to filter out
df <- rbind(df,
    data.frame(
        accounts = paste0("bad_account", 1:3),
        products = paste0("bad_product", 1:3),
        revenue = rnorm(3)
    )
)


# filter to just what is included in our "accounts" and "products" vectors
df <- df[df$accounts %in% account_vec, ]
df <- df[df$products %in% product_vec, ]


# spread out the products so they occupy the column values
df2 <- df %>% tidyr::spread(key="products", value="revenue")

# if you aren't familiar with the "%>%" pipe operator, the above
# line of code is equivalent to this one below:
# df2 <- tidyr::spread(df, key="products", value="revenue")

# now we have accounts as rows, products as columns, and revenues at the intersection

# we can go one step further by making the accounts our row names if we want
row.names(df2) <- df2$accounts
df2$accounts <- NULL

# now the accounts are in the row name and not in a column on their own