我目前正在尝试以降序的分层格式重新格式化文档库。 docFrom列包含较高级别的文档,subDoc包含较低级别的文档,其父列表示该文档向下的级别数,其中1表示顶部的文档。数据Docs
都是字符串,当前看起来像这样,唯一的区别是subDoc包含虚拟数据不会显示的所有唯一字符串,将它们想象为网络,节目和剧集的实际名称。>
docFrom subDoc Parent
NA Network 1 1
Network 1 TvShow 1 2
Network 1 TvShow 2 2
Network 1 TvShow 3 2
Network 1 TvShow 4 2
TvShow 1 Episode 1 3
TvShow 1 Episode 2 3
TvShow 2 Episode 1 3
TvShow 2 Episode 2 3
TvShow 3 Episode 1 3
TvShow 1 Episode 2 3
出于可视化目的,我想将其转换为
1 2 3
Network 1 TvShow 1 Episode 1
Network 1 TvShow 1 Episode 2
Network 1 TvShow 2 Episode 1
Network 1 TvShow 2 Episode 2
Network 1 TvShow 3 Episode 1
Network 1 TvShow 3 Episode 2
使用df <- reshape(Docs,idvar = "docFrom", timevar = "Parent", direction = "wide")
无效,
df <- spread(Docs, Parent, subDoc)
我试图找到解决方案,但找不到反映这种情况的任何数据。有什么功能可以用来重塑这样的数据框吗?
答案 0 :(得分:1)
我们将结合使用基R和sqldf()
包来解决此问题。我们可以使用Parent
列将数据分为3个数据帧,并合并两个结果数据帧,其中Parent
的电视节目名称为2或3。
textFile <- "docFrom |subDoc |Parent
NA |Network 1|1
Network 1|TvShow 1 |2
Network 1|TvShow 2 |2
Network 1|TvShow 3 |2
Network 1|TvShow 4 |2
TvShow 1 |Episode 1|3
TvShow 1 |Episode 2|3
TvShow 2 |Episode 1|3
TvShow 2 |Episode 2|3
TvShow 3 |Episode 1|3
TvShow 1 |Episode 2|3"
data <- read.csv(text = textFile,sep="|",stringsAsFactors = FALSE)
splitVar <- as.factor(data$Parent)
groupedData <- split(data,splitVar)
# second frame in list contains networks & shows
shows <- groupedData[[2]][-3]
colnames(shows) <- c("Network","Show")
# third frame in list contains shows and episodes
episodes <- groupedData[[3]][-3]
colnames(episodes) <- c("Show","Episode")
# use sqldf to join shows with episodes, since the shows data frame
# also includes the network names
library(sqldf)
sqlstmt <- "select s.Network, e.Show, e.Episode from shows s, episodes e where s.Show = e.Show"
result <- sqldf(sqlstmt)
result
...以及输出:
> result
Network Show Episode
1 Network 1 TvShow 1 Episode 1
2 Network 1 TvShow 1 Episode 2
3 Network 1 TvShow 1 Episode 2
4 Network 1 TvShow 2 Episode 1
5 Network 1 TvShow 2 Episode 2
6 Network 1 TvShow 3 Episode 1
>
我们可以使用dplyr::inner_join()
完成与sqldf()
进行的数据帧连接。一旦将传入数据按Parent
的值拆分为单独的数据帧,并从列表中提取出来以创建shows
和episodes
数据帧并重命名了列,我们就将两者合并数据帧如下。
# dplyr version
library(dplyr)
shows %>% inner_join(episodes, by = "Show")
...以及输出:
> shows %>% inner_join(episodes, by = "Show")
Network Show Episode
1 Network 1 TvShow 1 Episode 1
2 Network 1 TvShow 1 Episode 2
3 Network 1 TvShow 1 Episode 2
4 Network 1 TvShow 2 Episode 1
5 Network 1 TvShow 2 Episode 2
6 Network 1 TvShow 3 Episode 1
>
答案 1 :(得分:0)
我猜这里最好的建议是将Docs
分成2个不同的集合TVShows
和Episodes
,例如TvShows = filter(Docs, stringr::str_detect("TvShow"))
删除父列,调整列名称,然后full_join
。