我有以下data.frame
ARTICLE <- c("I'M ARTICLE #1","I'M ARTICLE #2","I'M ARTICLE #3","I'M ARTICLE #4")
SUBJECT.1 <- c("POLLUTION", "ACQUIRED", "INSIDER TRADING", "MERGERS & ACQUISITIONS")
SUBJECT.2 <- c("FRAUD", "POLLUTION & DAMAGES", "FRAUD & INSIDER TRADING", "OIL SPILLS")
SUBJECT.3 <- c("OIL", "BIOFUELS", "OIL SPILLS & WASTE", "EMISSIONS")
mydf <- data.frame(ARTICLE, SUBJECT.1, SUBJECT.2, SUBJECT.3)
mydf
# ARTICLE SUBJECT.1 SUBJECT.2 SUBJECT.3
# 1 I'M ARTICLE #1 POLLUTION FRAUD OIL
# 2 I'M ARTICLE #2 ACQUIRED POLLUTION & DAMAGES BIOFUELS
# 3 I'M ARTICLE #3 INSIDER TRADING FRAUD & INSIDER TRADING OIL SPILLS & WASTE
# 4 I'M ARTICLE #4 MERGERS & ACQUISITIONS OIL SPILLS EMISSIONS
我想对一些主题进行分组并为虚拟变量创建列。我想要4列名为POLLUTION, OILSPILLS, MERGERS and FRAUD
的列。只有当三个SUBJECT列中出现某些单词或部分单词时,此列中应该有1:
# POLLUTION: if the words "POLLUTION", "EMISSION", "WASTE" appear in one or more of the columns
# OILSPILLS: if the word "OIL SPILL" appears in one or more of the columns
# MERGERS: if the words "MERGER", "ACQUI" appear in one or more of the columns
# FRAUD: if the words "FRAUD", "CRIME" appear in one or more of the columns
输出应如下所示:
# ARTICLE SUBJECT.1 SUBJECT.2 SUBJECT.3 POLLUTION OILSPILLS MERGERS FRAUD
# 1 I'M ARTICLE #1 POLLUTION FRAUD OIL 1 0 0 1
# 2 I'M ARTICLE #2 ACQUIRED POLLUTION & DAMAGES BIOFUELS 1 0 1 0
# 3 I'M ARTICLE #3 INSIDER TRADING FRAUD & INSIDER TRADING OIL SPILLS & WASTE 1 1 0 1
# 4 I'M ARTICLE #4 MERGERS & ACQUISITIONS OILSPILLS EMISSIONS 1 1 1 0
由于我不知道如何做到这一点,我真的无法尝试任何事情。
谢谢!
答案 0 :(得分:3)
从“BondedDust”的答案略微修改
vec1 <- c(POLLUTION="POLLUTION|EMISSION|WASTE", OILSPILLS="OIL SPILL",
MERGERS="MERGER|ACQUI", FRAUD="FRAUD|CRIME")
sapply(vec1, function(x) apply(mydf[,-1],1, function(y) any(grepl(x, y))))+0
cbind(mydf, .Last.value)
答案 1 :(得分:2)
t( apply(mydf[-1], 1, function(x) as.numeric ( c(
# need the t() to change columns to rows
any( grepl("POLLUTION|EMISSION|WASTE", x) ),
any(grepl("OIL\\sSPILL", x) ),
any(grepl("MERGER|ACQUI", x) ),
any(grepl("MERGER|ACQUI", x) ) )
) )
)
#-------
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 1 0 1 1
[3,] 1 1 0 0
[4,] 1 1 1 1
cbind(mydf, .Last.value)
ARTICLE SUBJECT.1 SUBJECT.2
1 I'M ARTICLE #1 POLLUTION FRAUD
2 I'M ARTICLE #2 ACQUIRED POLLUTION & DAMAGES
3 I'M ARTICLE #3 INSIDER TRADING FRAUD & INSIDER TRADING
4 I'M ARTICLE #4 MERGERS & ACQUISITIONS OIL SPILLS
SUBJECT.3 1 2 3 4
1 OIL 1 0 0 0
2 BIOFUELS 1 0 1 1
3 OIL SPILLS & WASTE 1 1 0 0
4 EMISSIONS 1 1 1 1
可能有更优雅的方法,但这似乎足够“明显”,这个小脑袋可以把它放在“纸上”。列的命名似乎是微不足道的,它可以留作“读者的练习”。