根据R中的主题创建包含0或1的向量

时间:2014-06-21 04:49:21

标签: r vector dataframe

我有以下data.frame

ARTICLE   <- c("I'M ARTICLE #1","I'M ARTICLE #2","I'M ARTICLE #3","I'M ARTICLE #4")
SUBJECT.1 <- c("POLLUTION", "ACQUIRED", "INSIDER TRADING", "MERGERS & ACQUISITIONS")
SUBJECT.2 <- c("FRAUD", "POLLUTION & DAMAGES", "FRAUD & INSIDER TRADING", "OIL SPILLS")
SUBJECT.3 <- c("OIL", "BIOFUELS", "OIL SPILLS & WASTE", "EMISSIONS")

mydf <- data.frame(ARTICLE, SUBJECT.1, SUBJECT.2, SUBJECT.3)
mydf

#          ARTICLE              SUBJECT.1               SUBJECT.2          SUBJECT.3
# 1 I'M ARTICLE #1              POLLUTION                   FRAUD                OIL
# 2 I'M ARTICLE #2               ACQUIRED     POLLUTION & DAMAGES           BIOFUELS
# 3 I'M ARTICLE #3        INSIDER TRADING FRAUD & INSIDER TRADING OIL SPILLS & WASTE
# 4 I'M ARTICLE #4 MERGERS & ACQUISITIONS              OIL SPILLS          EMISSIONS

我想对一些主题进行分组并为虚拟变量创建列。我想要4列名为POLLUTION, OILSPILLS, MERGERS and FRAUD的列。只有当三个SUBJECT列中出现某些单词或部分单词时,此列中应该有1:

# POLLUTION: if the words "POLLUTION", "EMISSION", "WASTE" appear in one or more of the columns
# OILSPILLS: if the word "OIL SPILL" appears in one or more of the columns
# MERGERS: if the words "MERGER", "ACQUI" appear in one or more of the columns
# FRAUD: if the words "FRAUD", "CRIME" appear in one or more of the columns

输出应如下所示:

#          ARTICLE              SUBJECT.1               SUBJECT.2          SUBJECT.3 POLLUTION OILSPILLS MERGERS FRAUD
# 1 I'M ARTICLE #1              POLLUTION                   FRAUD                OIL         1         0       0     1
# 2 I'M ARTICLE #2               ACQUIRED     POLLUTION & DAMAGES           BIOFUELS         1         0       1     0
# 3 I'M ARTICLE #3        INSIDER TRADING FRAUD & INSIDER TRADING OIL SPILLS & WASTE         1         1       0     1
# 4 I'M ARTICLE #4 MERGERS & ACQUISITIONS               OILSPILLS          EMISSIONS         1         1       1     0

由于我不知道如何做到这一点,我真的无法尝试任何事情。

谢谢!

2 个答案:

答案 0 :(得分:3)

从“BondedDust”的答案略微修改

vec1 <- c(POLLUTION="POLLUTION|EMISSION|WASTE", OILSPILLS="OIL SPILL",
        MERGERS="MERGER|ACQUI", FRAUD="FRAUD|CRIME")

sapply(vec1, function(x) apply(mydf[,-1],1, function(y) any(grepl(x, y))))+0
cbind(mydf, .Last.value)

答案 1 :(得分:2)

t(  apply(mydf[-1], 1, function(x)  as.numeric ( c(   
   # need the t() to change columns to rows
 any( grepl("POLLUTION|EMISSION|WASTE", x) ),
 any(grepl("OIL\\sSPILL", x) ), 
 any(grepl("MERGER|ACQUI", x) ), 
 any(grepl("MERGER|ACQUI", x) )      )
       )     )
 )
 #-------
  [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    1    0    1    1
[3,]    1    1    0    0
[4,]    1    1    1    1

 cbind(mydf, .Last.value)
     ARTICLE              SUBJECT.1               SUBJECT.2
1 I'M ARTICLE #1              POLLUTION                   FRAUD
2 I'M ARTICLE #2               ACQUIRED     POLLUTION & DAMAGES
3 I'M ARTICLE #3        INSIDER TRADING FRAUD & INSIDER TRADING
4 I'M ARTICLE #4 MERGERS & ACQUISITIONS              OIL SPILLS
           SUBJECT.3 1 2 3 4
1                OIL 1 0 0 0
2           BIOFUELS 1 0 1 1
3 OIL SPILLS & WASTE 1 1 0 0
4          EMISSIONS 1 1 1 1

可能有更优雅的方法,但这似乎足够“明显”,这个小脑袋可以把它放在“纸上”。列的命名似乎是微不足道的,它可以留作“读者的练习”。