我有一个单一的语义标签字段&语义标签类型。每个标签类型/标签都是逗号分隔的,而每个标签类型和标签都是逗号分隔的。标签是冒号分开的(见下文)。
ID | Semantic Tags
1 | Person:mitch mcconnell, Person:ashley judd, Position:senator
2 | Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
3 | Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
4 | Person:ashley judd, topicname:politics
5 | URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc
我想分割每种标签类型(结肠前的术语)&标签(冒号后的术语)分为两个独立的字段:"标签类型" &安培; "标签&#34 ;.生成的文件应如下所示:
ID | Tag Type | Tag
1 | Person | mitch McConnell
1 | Person | ashley judd
1 | Position | senator
2 | Person | mitch McConnell
2 | Position | senator
2 | State | kentucky
这是我到目前为止的代码......
tag<-strsplit(as.character(emtable$Symantic.Tags),","))
tagtype<-strsplit(as.character(tag),":")
但在那之后,我迷路了!我相信我需要使用lapply或sapply,但我不确定它在哪里...
如果在网站上以其他形式回答了我的道歉 - 我是R&amp; S的新手。这对我来说仍然有点复杂。
提前感谢任何人的帮助。
答案 0 :(得分:4)
这是另一种(略有不同)的方法:
## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info
dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)])
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /
dat3 <- data.frame(ID=rep(seq_along(dat3), sapply(dat3, length)),
do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))
)
colnames(dat3)[-1] <- c("Tag Type", "Tag")
## ID Tag Type Tag
## 1 1 Person mitch mcconnell
## 2 1 Person ashley judd
## 3 1 Position senator
## 4 2 Person mitch mcconnell
## 5 2 Position senator
## 6 2 ProvinceOrState kentucky
## 7 2 topicname politics
## 8 3 Person mitch mcconnell
## 9 3 Person ashley judd
## 10 3 Organization senate
## 11 3 Organization republican
## 12 4 Person ashley judd
## 13 4 topicname politics
## 14 5 URL www.huffingtonpost.com
## 15 5 Company usa today
## 16 5 Person chuck todd
## 17 5 Company msnbc
彻底解释:
## dat <- readLines(n=5)
## Person:mitch mcconnell, Person:ashley judd, Position:senator
## Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## Person:ashley judd, topicname:politics
## URL:www.huffingtonpost.com, URL:http://www.regular-expressions.info
dat3 <- lapply(strsplit(dat, ","), function(x) gsub("^\\s+|\\s+$", "", x))
#dat3 <- lapply(dat2, function(x) x[grepl("Person|Position", x)])
dat3 <- lapply(dat3, strsplit, ":(?!/)", perl=TRUE) #break on : not folled by /
# Let the explanation begin...
# Here I have a short list of the variables from the rows
# of the original dataframe; in this case the row numbers:
seq_along(dat3) #row variables
# then I use sapply and length to figure out hoe long the
# split variables in each row (now a list) are
sapply(dat3, length) #n times
# this tells me how many times to repeat the short list of
# variables. This is because I stretch the dat3 list to a vector
# Here I rep the row variables n times
rep(seq_along(dat3), sapply(dat3, length))
# better assign that for later:
ID <- rep(seq_along(dat3), sapply(dat3, length))
#============================================
# Now to explain the next chunk...
# I take dat3
dat3
# Each element in the list 1-5 is made of a new list of
# Vectors of length 2 of Tag_Types and Tags.
# For instance here's element 5 a list of two lists
# with character vectors of length 2
## [[5]]
## [[5]][[1]]
## [1] "URL" "www.huffingtonpost.com"
##
## [[5]][[2]]
## [1] "URL" "http://www.regular-expressions.info"
# Use str to look at this structure:
dat3[[5]]
str(dat3[[5]])
## List of 2
## $ : chr [1:2] "URL" "www.huffingtonpost.com"
## $ : chr [1:2] "URL" "http://www.regular-expressions.info"
# I use lapply (list apply) to apply an anynomous function:
# function(x) do.call(rbind, x)
#
# TO each of the 5 elements. This basically glues the list
# of vectors together to make a matrix. Observe just on elenet 5:
do.call(rbind, dat3[[5]])
## [,1] [,2]
## [1,] "URL" "www.huffingtonpost.com"
## [2,] "URL" "http://www.regular-expressions.info"
# We use lapply to do that to all elements:
lapply(dat3, function(x) do.call(rbind, x))
# We then use the do.call(rbind on this list and we have a
# matrix
do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))
# Let's assign that for later:
the_mat <- do.call(rbind, lapply(dat3, function(x) do.call(rbind, x)))
#============================================
# Now we put it all together with data.frame:
data.frame(ID, the_mat)
答案 1 :(得分:3)
DF
## ID Semantic.Tags
## 1 1 Person:mitch mcconnell, Person:ashley judd, Position:senator
## 2 2 Person:mitch mcconnell, Position:senator, ProvinceOrState:kentucky, topicname:politics
## 3 3 Person:mitch mcconnell, Person:ashley judd, Organization:senate, Organization:republican
## 4 4 Person:ashley judd, topicname:politics
## 5 5 URL:www.huffingtonpost.com, Company:usa today, Person:chuck todd, Company:msnbc
ll <- lapply(strsplit(DF$Semantic.Tags, ","), strsplit, split = ":")
f <- function(x) do.call(rbind, x)
f(lapply(ll, f))
## [,1] [,2]
## [1,] " Person" "mitch mcconnell"
## [2,] " Person" "ashley judd"
## [3,] " Position" "senator"
## [4,] " Person" "mitch mcconnell"
## [5,] " Position" "senator"
## [6,] " ProvinceOrState" "kentucky"
## [7,] " topicname" "politics "
## [8,] " Person" "mitch mcconnell"
## [9,] " Person" "ashley judd"
## [10,] " Organization" "senate"
## [11,] " Organization" "republican "
## [12,] " Person" "ashley judd"
## [13,] " topicname" "politics"
## [14,] " URL" "www.huffingtonpost.com"
## [15,] " Company" "usa today"
## [16,] " Person" "chuck todd"
## [17,] " Company" "msnbc"