在我的data.frame中,我在一行的SUBJECT上有三列。我想要一个额外的列,每行都有一个唯一的主题。首先,我的数据如何:
DATE <- c("1","2","3","4","5","6","7","1","2","3","4","5","6","7")
COMP <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
CLASS <- c("positive", "negative", "aneutral", "positive", "positive", "negative", "aneutral", "positive", "negative", "negative", "positive", "aneutral", "aneutral", "aneutral")
SUBJECT.1 <- c("LITIGATION","LAYOFF","POLLUTION","CHEMICAL DISASTER","PRESS RELEASE","PEOPLE","EMISSIONS","ENERGY","WASTE MANAGEMENT","EMPLOYEES","MANAGEMENT","PRESS RELEASE","HOTELS","POLLUTION")
SUBJECT.2 <- c("POLLUTION","EMPLOYEES","NUCLEAR","FUELS","STOCK OPTION PLAN","EXECUTIVES","CO2","SOLAR","POLLUTION","EXECUTIVES","PRESS RELEASE","CELEBRITIES","CELEBRITIES","LITIGATION")
SUBJECT.3 <- c("ENVIRONMENT","JOB REDUCTIONS","POWER PLANTS","POLLUTION","EMPLOYEES","FRAUD","CLIMATE CHANGE","SUSTAINABILITY","HAZARDOUS WASTE","BONUS PAY","LITIGATION","EMISSIONS","SCANDALS","SCANDALS")
CONTROLVAR <- c("11","13","13","14","13","14","12","11","13","13","14","13","14","12")
mydf <- data.frame(DATE, COMP, RET, CLASS, SUBJECT.1, SUBJECT.2, SUBJECT.3, CONTROLVAR, stringsAsFactors=F)
mydf
# DATE COMP RET CLASS SUBJECT.1 SUBJECT.2 SUBJECT.3 CONTROLVAR
# 1 1 A -2.00 positive LITIGATION POLLUTION ENVIRONMENT 11
# 2 2 A 1.10 negative LAYOFF EMPLOYEES JOB REDUCTIONS 13
# 3 3 A 3.00 aneutral POLLUTION NUCLEAR POWER PLANTS 13
# 4 4 A 1.40 positive CHEMICAL DISASTER FUELS POLLUTION 14
# 5 5 A -0.20 positive PRESS RELEASE STOCK OPTION PLAN EMPLOYEES 13
# 6 6 A 0.60 negative PEOPLE EXECUTIVES FRAUD 14
# 7 7 A 0.10 aneutral EMISSIONS CO2 CLIMATE CHANGE 12
# 8 1 B -0.21 positive ENERGY SOLAR SUSTAINABILITY 11
# 9 2 B -1.20 negative WASTE MANAGEMENT POLLUTION HAZARDOUS WASTE 13
# 10 3 B 0.90 negative EMPLOYEES EXECUTIVES BONUS PAY 13
# 11 4 B 0.30 positive MANAGEMENT PRESS RELEASE LITIGATION 14
# 12 5 B -0.10 aneutral PRESS RELEASE CELEBRITIES EMISSIONS 13
# 13 6 B 0.30 aneutral HOTELS CELEBRITIES SCANDALS 14
# 14 7 B -0.12 aneutral POLLUTION LITIGATION SCANDALS 12
由于我想将主题作为虚拟变量(应该是独占的)包含在以后的回归中,我想要一个单独的列SUBJECT,每行有一个唯一的主题。我想专注于主题诉讼,污染和LAYOFF。
我想从左到右检查每个SUBJECT列
c("LITIGAT", "CLAIM", "SUIT", "JUDG") -> LITIGATION
c("POLLUT", "WAST", "EMISSION") -> POLLUTION
c("LAYOFF") -> LAYOFF
如果第一栏中有LITIGATION,POLLUTION或LAYOFF的单词部分之一,则会拍摄此主题。如果第一列中有不同的主题,我会检查第二列,依此类推。如果三个主题栏中没有一个包含LITIGATION,POLLUTION或LAYOFF的任何单词部分,则该主题应称为OTHER。
输出应如下所示:
# DATE COMP RET CLASS SUBJECT.1 SUBJECT.2 SUBJECT.3 SUBJECT CONTROLVAR
# 1 1 A -2.00 positive LITIGATION POLLUTION ENVIRONMENT LITIGATION 11
# 2 2 A 1.10 negative LAYOFF EMPLOYEES JOB REDUCTIONS LAYOFF 13
# 3 3 A 3.00 aneutral POLLUTION NUCLEAR POWER PLANTS POLLUTION 13
# 4 4 A 1.40 positive CHEMICAL DISASTER FUELS POLLUTION POLLUTION 14
# 5 5 A -0.20 positive PRESS RELEASE STOCK OPTION PLAN EMPLOYEES OTHER 13
# 6 6 A 0.60 negative PEOPLE EXECUTIVES FRAUD OTHER 14
# 7 7 A 0.10 aneutral EMISSIONS CO2 CLIMATE CHANGE POLLUTION 12
# 8 1 B -0.21 positive ENERGY SOLAR SUSTAINABILITY OTHER 11
# 9 2 B -1.20 negative WASTE MANAGEMENT POLLUTION HAZARDOUS WASTE POLLUTION 13
# 10 3 B 0.90 negative EMPLOYEES EXECUTIVES BONUS PAY OTHER 13
# 11 4 B 0.30 positive MANAGEMENT PRESS RELEASE LITIGATION LITIGATION 14
# 12 5 B -0.10 aneutral PRESS RELEASE CELEBRITIES EMISSIONS POLLUTION 13
# 13 6 B 0.30 aneutral HOTELS CELEBRITIES SCANDALS OTHER 14
# 14 7 B -0.12 aneutral POLLUTION LITIGATION SCANDALS POLLUTION 12
答案 0 :(得分:1)
尝试:
dat <- stack(sapply(c("LITIGATION", "POLLUTION", "LAYOFF"),
function(x) grep(paste(get(x),collapse="|"),
as.character(interaction(mydf[,5:7],sep=" ")))))
dat2 <- merge(dat, data.frame(values=1:14),all=TRUE)
dat2N <- dat2[!duplicated(dat2$values),] ##delete duplicated values
dat2N$ind <- as.character(dat2N$ind)
dat2N$ind[is.na(dat2N$ind)] <- "OTHER" ##change NAs to "OTHER"
transform(mydf, SUBJECT=dat2N$ind)
您已经创建了三个向量:(粘贴代码)
c("LITIGAT", "CLAIM", "SUIT", "JUDG") -> LITIGATION
c("POLLUT", "WAST", "EMISSION") -> POLLUTION
c("LAYOFF") -> LAYOFF
因此,运行以下代码会给出:
lapply(c("LITIGATION","POLLUTION", "LAYOFF"), function(x) get(x)) #get search by name for an object
#[[1]]
#[1] "LITIGAT" "CLAIM" "SUIT" "JUDG"
#[[2]]
#[1] "POLLUT" "WAST" "EMISSION"
#[[3]]
#[1] "LAYOFF"
然后我粘贴了组件,将一个字符串分隔为&#34; |&#34; grep
sapply(c("LITIGATION","POLLUTION", "LAYOFF"), function(x) paste(get(x),collapse="|"))
# LITIGATION POLLUTION LAYOFF
#"LITIGAT|CLAIM|SUIT|JUDG" "POLLUT|WAST|EMISSION" "LAYOFF"
as.character(interaction(mydf[,5:7],sep=" ")) #pasted the concerned columns rowwise
#[1] "LITIGATION POLLUTION ENVIRONMENT"
#[2] "LAYOFF EMPLOYEES JOB REDUCTIONS"
#[3] "POLLUTION NUCLEAR POWER PLANTS"
#[4] "CHEMICAL DISASTER FUELS POLLUTION"
#[5] "PRESS RELEASE STOCK OPTION PLAN EMPLOYEES"
使用grep
搜索组合行中的模式我使用stack
获取行索引以及对象名称,即。诉讼,污染。然后我将数据集与rownumbers合并。你也可以使用?match。由于有多个值映射到不同的对象,因此使用duplicated. Changed the
ind column from
因子to
字符and the
ind`选择了第一个值并将其删除到&#34;其他&#34;。