堆积的桌子传播和合并

时间:2017-10-29 07:06:07

标签: r dataframe merge tidyr

我从W3C下载SKOS Schema表来准备词汇映射任务。这是“dput”中的示例构建:

> dput(skosc)
structure(list(X1 = c("skos:Collection", "URI:", "Definition:", 
"Label:", "Disjoint classes:", "skos:Concept", "URI:", "Definition:", 
"Label:", "Disjoint classes:", "skos:ConceptScheme", "URI:", 
"Definition:", "Label:", "Disjoint classes:", "skos:OrderedCollection", 
"URI:", "Definition:", "Label:", "Super-classes:"), X2 = c("skos:Collection", 
"http://www.w3.org/2004/02/skos/core#Collection", "Section 9. \r\n      Concept Collections", 
"Collection", "skos:Conceptskos:ConceptScheme", "skos:Concept", 
"http://www.w3.org/2004/02/skos/core#Concept", "Section 3. The \r\n      skos:Concept Class", 
"Concept", "skos:Collectionskos:ConceptScheme", "skos:ConceptScheme", 
"http://www.w3.org/2004/02/skos/core#ConceptScheme", "Section 4. \r\n      Concept Schemes", 
"Concept Scheme", "skos:Collectionskos:Concept", "skos:OrderedCollection", 
"http://www.w3.org/2004/02/skos/core#OrderedCollection", "Section 9. \r\n      Concept Collections", 
"Ordered Collection", "skos:Collection")), .Names = c("X1", "X2"
), class = "data.frame", row.names = c(NA, -20L))

除了每个小表的副标题(例如“skos:Collection”,“skos:Concept”等)之外,这个堆叠表中还有一个奇怪的地方我们必须注意:rownames也不是全部相同,比如示例中的No.20 Row,它将上面的小表命名为“Super-classes:”,而不是“Disjoint classes:”。 我的计划是拆分这个堆叠表并按如下方式转换:

在:

the origin table before we process

后:

the table after processing should be

“dplyr”和“tidyr”都擅长操纵表格,我选择“扩展”功能,可以将表从长而窄变为短而宽。不幸的是,它失败了:

> skosns<-"http://www.w3.org/2009/08/skos-reference/skos.html"
> require(rvest)
载入需要的程辑包:rvest
载入需要的程辑包:xml2
> skospg<-read_html(skosns, encoding = "UTF-8", options = c("RECOVER", "NOERROR", "NSCLEAN"))
> skosnd<-html_nodes(skospg, "table")
> skosc<-html_table(skosnd[[1]], header = FALSE, trim = TRUE, fill = FALSE, dec = ".")
> skosp<-html_table(skosnd[[2]], header = FALSE, trim = TRUE, fill = FALSE, dec = ".")
> require(tidyr)
载入需要的程辑包:tidyr
> spread(skosc, key = X1, value = X2)
Error: Duplicate identifiers for rows (3, 8, 13, 18), (5, 10, 15), (4, 9, 14, 19), (2, 7, 12, 17)

错误按摩并没有告诉我很多原因,但我想这可能是奇怪的行导致了这个错误。我们可以忽略小表之间的差异,只将相同的值分散到不同的列中吗?

  • 问题已更新:

学者akrun在commont中的代码帖子非常有帮助,我了解到如果一列中还有2个值,我们需要对它们进行分组并首先改变结构。然后可以传播数据帧。感谢akrum !!! 现在是最后一个过程:删除词汇名称列(例如“skos:Collection”)并将它们传输到相应的行。但是我在编写内置函数时有一个弱点,所以程序失败并不令人惊讶:

> require(rvest)
> skospg<-read_html(skosns, encoding = "UTF-8", options = c("RECOVER", "NOERROR", "NSCLEAN"))
> skosnd<-html_nodes(skospg, "table")
> skosc<-html_table(skosnd[[1]], header = FALSE, trim = TRUE, fill = FALSE, dec = ".")
> require(dplyr)
> skosc_g<-group_by(skosc, X1)
> skosc_m<-mutate(skosc_g, n = row_number())
> require(tidyr)
载入需要的程辑包:tidyr
> skosc_t<-spread(skosc_m, key = X1, value = X2)
> vocn<-select_all(skosc_t, funs(colnames=grep("[[:alpha:]]+:[[:alpha:]]+")))
Error in grep("[[:alpha:]]+:[[:alpha:]]+") : 
  argument "x" is missing, with no default
> merge.data.frame(vocn, skosc_t, by=c("Collection", "Concept", "ConceptScheme"))
Error in as.data.frame(x) : object 'vocn' not found

本段的计划是提取具有值作为词汇表名称的列{skosc_t [c(5,6,7,8),]},然后将它们与这些列已经存在的数据帧合并删除{skosc_t [C(2,3,4,9,10),]}: whole skosc_t dataframe 怎么做是对的?非常感谢。

0 个答案:

没有答案