根据另一个数据框中的列创建新的数据框行

时间:2016-04-24 19:40:17

标签: r dataframe

我有2个数据框,第一列是列表(df A),另一列的第一列包含列表中的项目,但在某些情况下每行包含多个项目(df B)。 我想要做的是从df A开始为每个项创建新行,这些行发生在df B的第一列。

DF A

dfA
  Index  X
1  1    alpha
2  2    beta
3  3    gamma
4  4    delta

DF B

dfB
  list    X  
1  1 4    alpha
2  3 2 1  beta
3  4 1 2  gamma
4  3      delta

所需

dfC
  Index   x
1  1     alpha
2  4     alpha
3  3     beta
4  2     beta
5  1     beta
6  4     gamma
7  1     gamma
8  2     gamma
9  3     delta

我使用的实际数据: DF A

dput(head(allwines))
structure(list(Wine = c("Albariño", "Aligoté", "Amarone", "Arneis", 
"Asti Spumante", "Auslese"), Description = c("Spanish white wine grape that makes crisp, refreshing, and light-bodied wines.", 
"White wine grape grown in Burgundy making medium-bodied, crisp, dry wines with spicy character.", 
"From Italy’s Veneto Region a strong, dry, long- lived red, made from a blend of partially dried red grapes.", 
"A light-bodied dry wine the Piedmont Region of Italy", "From the Piedmont Region of Italy, A semidry sparkling wine produced from the Moscato di Canelli grape in the village of Asti", 
"German white wine from grapes that are very ripe and thus high in sugar"
)), .Names = c("Wine", "Description"), row.names = c(NA, 6L), class = "data.frame")

DF B

> dput(head(cheesePairing))
structure(list(Wine = c("Cabernet Sauvignon\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Pinot Noir\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Sauvignon Blanc\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Zinfandel", 
"Chianti\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Pinot Noir\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Sangiovese", 
"Chardonnay", "Bardolino\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Malbec\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Riesling\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Rioja\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Sauvignon Blanc", 
"Tempranillo", "Asti Spumante"), Cheese = c("Abbaye De Belloc Cheese", 
"Ardrahan cheese", "Asadero cheese", "Asiago cheese", "Azeitao", 
"Baby Swiss Cheese"), Suggestions = c("Pair with apples,  sliced pears OR  a sampling of olives and thin sliced salami.  Pass around slices of baguette.", 
"Serve with a substantial wheat cracker and apples or grapes.", 
"Rajas (blistered chile strips) fresh corn tortillas", "Table water crackers, raw nuts (almond, walnuts)", 
"Nutty brown bread, grapes", "Server with dried fruits, whole grain, nutty breads, nuts"
)), .Names = c("Wine", "Cheese", "Suggestions"), row.names = c(NA, 
6L), class = "data.frame")

2 个答案:

答案 0 :(得分:2)

基于Curt的答案,我想我找到了一个更有效的解决方案......假设我正确地解释了你的目标。

我的评论代码如下。您应该能够按原样运行并获得所需的dfC。需要注意的一点是,我假设(根据您的实际数据)分隔符dfB $ Index是" \ r \ n"。

# set up fake data
dfA<-data.frame(Index=c('1','2','3','4'), X=c('alpha','beta','gamma','delta'))
dfB<-data.frame(Index=c('1 \r\n 4','3 \r\n 2 \r\n 1','4 \r\n 1 \r\n 2','3'), X=c('alpha','beta','gamma','delta'))

dfA$Index<-as.character(dfA$Index)
dfA$X<-as.character(dfA$X)
dfB$Index<-as.character(dfB$Index)
dfB$X<-as.character(dfB$X)


dfB_index_parsed<-strsplit(x=dfB$Index,"\r\n") # split Index of dfB by delimiter "\r\n" and store in a list
names(dfB_index_parsed)<-dfB$X
dfB_split_num<-lapply(dfB_index_parsed, length) # find the number of splits per row of dfB and store in a list
dfB_split_num_vec<-do.call('c', dfB_split_num) # convert number of splits above from list to vector

g<-do.call('c',dfB_index_parsed) # store all split values in a single vector
g<-gsub(' ','',g) # remove trailing/leading spaces that occur after split
names(g)<-rep(names(dfB_split_num_vec), dfB_split_num_vec ) # associate each split Index from dfB with X from dfB
g<-g[g %in% dfA$Index] # check which dfB$Index are in dfA$Index

dfC<-data.frame(Index=g, X=names(g)) # construct data.frame

答案 1 :(得分:0)

首先,让我为您的问题提供功能性答案。我怀疑我的答案非常有效,但它确实有效。

# construct toy data
dfA <- data.frame(index = 1:4, X = letters[1:4])

dfB <- data.frame(X = letters[1:4])
dfB$list_elements <- list(c(1, 4), c(3, 2, 1), c(4, 1, 2), c(3))

# define function that provides solution

unlist_merge_df <- function(listed_df, reference_df){
    # reference_df assumed to have columns "X" and "index"
    # listed_df assumed to have column "list_elements"
    df_out <- data.frame(index = c(), X = c())
    my_list <- listed_df$list_elements
    for(idx in 1:length(my_list)){
        df_out <- rbind(df_out, 
                        data.frame(index = my_list[[idx]], 
                                   X = listed_df[idx, 'X'])
                        )
    }
    return(df_out)
}

# call the function
dfC <- unlist_merge_df(dfB, dfA)

# show output in human and R-parseable formats
dfC

dput(dfC)

输出结果为:

index   X
1   1   a
2   4   a
3   3   b
4   2   b
5   1   b
6   4   c
7   1   c
8   2   c
9   3   d

structure(list(index = c(1, 4, 3, 2, 1, 4, 1, 2, 3), X = structure(c(1L, 
1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L), .Label = c("a", "b", "c", "d"
), class = "factor")), .Names = c("index", "X"), row.names = c(NA, 
9L), class = "data.frame")

其次,让我说你所处的情况并不是很理想。如果你可以避免它,你可能应该。要么根本不使用数据帧,要么只使用列表,要么完全避免构建列出的数据帧(如果可以的话),并直接构造所需的输出。