我正在清理一个巨大的数据集,这是因为在PDF上使用了tabulizer()
。
这些列已正确描绘,但是我有很多行,其中原始单元格很大,tabulizer
将其读取为几行,除大单元格外所有单元格均为空白。我需要折叠数据框,以便将行“折叠”到最低的完整行。
如您所见,这些“额外的行”出现的列因行而异(在一种情况下为species
,在其他情况下为area.of.operation
。我想折叠它们以完成行,这样第1行保持不变,第2行实际上是2:6折叠的行,第7行保持不变,依此类推。我什至不知道R是否是用于此目的的最佳工具,但是我很乐意dplyr
解决方案。以下示例数据框。
谢谢。
mydata <- structure(list(X = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 11L, 12L, 13L,
17L), target.species = structure(c(4L, 1L, 1L, 1L, 1L, 5L, 4L,
1L, 1L, 2L, 3L), .Label = c("", "hake", "hake, southern", "rosefish",
"squid, cuttlefish,"), class = "factor"), gear = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 2L), .Label = c("", "trawl, bottom",
"trawl, midwater"), class = "factor"), number.boats = structure(c(2L,
1L, 1L, 1L, 1L, 3L, 5L, 1L, 1L, 4L, 4L), .Label = c("", "18 vessels",
"98 refrigerated high", "none provided", "seas vessels"), class = "factor"),
company = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
2L, 2L), .Label = c("", "not applicable"), class = "factor"),
area.of.operation = structure(c(2L, 1L, 1L, 1L, 3L, 4L, 2L,
3L, 4L, 2L, 5L), .Label = c("", "above provinces", "annual fishery; EEZ",
"concentrated around", "deepwater coastal"), class = "factor"),
species = structure(c(6L, 3L, 4L, 5L, 9L, 8L, 7L, 9L, 8L,
1L, 2L), .Label = c("Fur seal", "none provided", "otter",
"otter, river", "porpoise", "seal", "Seal", "South American Sea lion,",
"spectacled porpoise,"), class = "factor"), estimates = structure(c(2L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 2L), .Label = c("", "none"
), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
答案 0 :(得分:0)
较旧的cumsum
-split
-ting策略,通过在每一列上粘贴collapse =“,”,然后sub
-删除多余的逗号可以使您获得大部分收益:< / p>
t( as.data.frame( # transpose because of the column oriented nature of R's apply returns
lapply( split(mydata, cumsum( mydata$target.species != "")),
function(d){ sub(",.*,", ",", lapply( d, paste, collapse=","))})))
[,1] [,2] [,3] [,4] [,5] [,6]
X1 "1,5" "rosefish," "trawl," "18 vessels," "not applicable," "above provinces,annual fishery; EEZ"
X2 "6" "squid," "" "98 refrigerated high" "" "concentrated around"
X3 "7,12" "rosefish," "trawl," "seas vessels," "not applicable," "above provinces,concentrated around"
X4 "13" "hake" "trawl, midwater" "none provided" "not applicable" "above provinces"
X5 "17" "hake, southern" "trawl, bottom" "none provided" "not applicable" "deepwater coastal"
[,7] [,8]
X1 "seal," "none,"
X2 "South American Sea lion," ""
X3 "Seal," "none,"
X4 "Fur seal" "none"
X5 "none provided" "none"