我希望有人能帮助我,因为我目前使用grepl的方法不会导致任何效果
我有几个类别(存储为字符)。我现在想构建一个变量,为不同的类别采用不同的值。
数据如下所示
category
Candidate Biography
Candidate Biography
Candidate Biography
Candidate Biography, Campaign Finance
Justice, Candidate Biography, Economy
Candidate Biography, Jobs
Economy, Education, Candidate Biography
Economy, Civil Rights, Candidate Biography
现在,我想根据如下所示的类别创建可以采用不同值的新变量
category CandBio Economy CivilRights Family
Candidate Biography 1 0 0 0
Candidate Biography 1 0 0 0
Candidate Biography 1 0 0 0
Candidate Biography, Campaign Finance 0.5 0.5 0 0
Justice, Candidate Biography, Economy 0.33 0.33 0.33 0
Candidate Biography, Jobs 0.5 0.5 0 0
Economy, Education, Candidate Biography 0.33 0.33 0 0.33
Economy, Civil Rights, Candidate Biography 0.33 0.33 0.33 0
每个类别都有一个针对每个变量的特定因素(并且可以加载到不同类别中)。例如。 “候选人简历,竞选财务”分别对CandBio和Economy 0.5加载。数据集中的许多观测值再次出现类别。 (在示例中,共有120个不同类别的49k obs需要汇总为10个变量,例如示例中的CandBio,Economy,CivilRights等)
我首先尝试将ifelse和grepl结合使用,但是我意识到grepl对顺序非常敏感,并且可以根据我构造ifelse的方式对每个类别进行故障分类。另外,我尝试获取所有类别词共享相似编号的vactor,然后将向量包含在grepl函数中,但该方法也不起作用。
因此,我正在寻找可以帮助我根据类别文本将权重分配给变量的解决方案。
我希望我能清楚地描述我的问题,并希望得到您的帮助,我们将不胜感激!事先非常感谢!
编辑:到目前为止,我已经尝试过这种方法,但是没有成功:
clintontvad$CandidateBiography <- ifelse(ifelse(grepl("Candidate Biography", clintontvad$subjects),1,
ifelse(grepl("Candidate Biography, Marriage, Gays and Lesbians, Civil Rights, Immigration, Trade, Energy, Workers", clintontvad$subjects), 0.125,
ifelse(grepl("Candidate Biography, Terrorism, Islam, Foreign Policy, Nuclear, Iran", clintontvad$subjects),0.17,
ifelse(grepl("Children, Candidate Biography, Families, Education, Debt, Economy, Jobs", clintontvad$subjects),0.17,
ifelse(grepl("Candidate Biography, Children, Education, Health Care, Women", clintontvad$subjects), 0.2,
ifelse(grepl("Candidate Biography, Civil Rights, Islam, Gays and Lesbians, Women", clintontvad$subjects), 0.2,
ifelse(grepl("Candidate Biography, Economy, Election, Children, Families", clintontvad$subjects), 0.2,
ifelse(grepl("Children, Education, Women, Economy, Families", clintontvad$subjects), 0.2,
ifelse(grepl("Job Accomplishments, Abortion, Women, Health Care, Climate Change, Marriage", clintontvad$subjects), 0.2,
ifelse(grepl("Women, Civil Rights, Gays and Lesbians, Foreign Policy, Canddate Biography", clintontvad$subjects), 0.25,
ifelse(grepl("Poverty, Health Care, Candidate Biography, Terrorism", clintontvad$subjects), 0.25,
ifelse(grepl("Job Accomplishments, Foreign Policy, Health Care, Children", clintontvad$subjects), 0.25,
ifelse(grepl("Foreign Policy, Terrorism, Candidate Biography", clintontvad$subjects),0.25,
ifelse(grepl("Ethics, Terrorism, Candidate Biography", clintontvad$subjects),0.25, 0)))))))))))))
答案 0 :(得分:1)
如果我正确理解了您的示例,那么新变量的权重取决于每一行中类别的数量。在这种情况下,您可以使用两步方法。首先创建新变量,然后除以匹配类别的数量。
d <- data.frame(category = c("Candidate Biography", "Candidate Biography", "Candidate Biography",
"Candidate Biography, Campaign Finance",
"Justice, Candidate Biography, Economy", "Candidate Biography, Jobs",
"Economy, Education, Candidate Biography",
"Economy, Civil Rights, Candidate Biography"))
# create a list with all your new variables and their respective categories
categories <- list(
CandBio = c("Candidate Biography"),
Economy = c("Campaign Finance", "Economy", "Jobs"),
CivilRights = c("Justice", "Civil Rights"),
Family = c("Education")
)
# create the new variables
for (i in seq_along(categories)) {
d[names(categories)[i]] <- grepl(paste0(categories[[i]], collapse = "|"), d[, "category"])
}
# divide by number of matched categories
d[, -1] <- d[, -1]/rowSums(d[, -1])
d
category CandBio Economy CivilRights Family
1 Candidate Biography 1.0000000 0.0000000 0.0000000 0.0000000
2 Candidate Biography 1.0000000 0.0000000 0.0000000 0.0000000
3 Candidate Biography 1.0000000 0.0000000 0.0000000 0.0000000
4 Candidate Biography, Campaign Finance 0.5000000 0.5000000 0.0000000 0.0000000
5 Justice, Candidate Biography, Economy 0.3333333 0.3333333 0.3333333 0.0000000
6 Candidate Biography, Jobs 0.5000000 0.5000000 0.0000000 0.0000000
7 Economy, Education, Candidate Biography 0.3333333 0.3333333 0.0000000 0.3333333
8 Economy, Civil Rights, Candidate Biography 0.3333333 0.3333333 0.3333333 0.0000000
答案 1 :(得分:0)
只要我理解正确,这是一种方法。您需要一个用于类别的匹配向量,并且您需要密切关注案件或是否有特殊字符。但这应该可以让您开始。让我知道您是否有任何问题。同样,事后看来,我将太多的事物称为“类别”,但您应该明白这一点。 category1
2
和3
指的是组成您更广泛的群体的任何事物(例如Economy
和CivilRights
)。最后,如果这很慢,那么使用stringi
中的函数而不是grepl
可能会快很多。如果此基本解决方案太慢,我可以发布编辑。
# Example dataframe
df <- data.frame(category = c("cat 1a",
"cat 1a",
"cat 1a",
"cat 1a, cat 2a",
"cat 3a, cat 1a, cat 2b",
"cat 1a, cat 2c"),
stringsAsFactors = F)
# Create a list with strings split based on the comma
string_list <- strsplit(df$category, split = ",", fixed = TRUE)
# Pre defined categories
category1 <- c("cat 1a", "cat 1b", "cat 1c")
category2 <- c("cat 2a", "cat 2b", "cat 2c")
category3 <- c("cat 3a", "cat 3b", "cat 3c")
# Create new columns based on your categories
df$Category_1 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category1, collapse = "|"), unlist(string_list[x]))) /
length(unlist(string_list[x])))
df$Category_2 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category2, collapse = "|"), unlist(string_list[x]))) /
length(unlist(string_list[x])))
df$Category_3 <- sapply(1:length(string_list) , function (x) any(grepl(paste(category3, collapse = "|"), unlist(string_list[x]))) /
length(unlist(string_list[x])))
df
category Category_1 Category_2 Category_3
1 cat 1a 1.0000000 0.0000000 0.0000000
2 cat 1a 1.0000000 0.0000000 0.0000000
3 cat 1a 1.0000000 0.0000000 0.0000000
4 cat 1a, cat 2a 0.5000000 0.5000000 0.0000000
5 cat 3a, cat 1a, cat 2b 0.3333333 0.3333333 0.3333333
6 cat 1a, cat 2c 0.5000000 0.5000000 0.0000000
编辑:使用@ Gilean0709友好提供的数据(和stringi,以使其更快),这是udpdate:
# Example dataframe
df <- data.frame(category = c("Candidate Biography", "Candidate Biography", "Candidate Biography",
"Candidate Biography, Campaign Finance",
"Justice, Candidate Biography, Economy", "Candidate Biography, Jobs",
"Economy, Education, Candidate Biography",
"Economy, Civil Rights, Candidate Biography"), stringsAsFactors = F)
# Create a list with strings split based on the comma
string_list <- strsplit(df$category, split = ",", fixed = TRUE)
library(stringi)
# Pre defined categories
CandBio <- paste(c("Candidate Biography"), collapse = "|")
Economy <- paste(c("Campaign Finance", "Economy", "Jobs"), collapse = "|")
CivilRights <- paste(c("Justice", "Civil Rights"), collapse = "|")
Family <- paste(c("Education"), collapse = "|")
# Create new columns based on your categories
df$CandBio <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), CandBio)) /
length(unlist(string_list[x])))
df$Economy <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), Economy)) /
length(unlist(string_list[x])))
df$CivilRights <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), CivilRights)) /
length(unlist(string_list[x])))
df$Family <- sapply(1:length(string_list), function (x) any(stri_detect_regex(unlist(string_list[x]), Family)) /
length(unlist(string_list[x])))
df %>%
mutate_if(is.numeric, round, digits = 2)
category CandBio Economy CivilRights Family
1 Candidate Biography 1.00 0.00 0.00 0.00
2 Candidate Biography 1.00 0.00 0.00 0.00
3 Candidate Biography 1.00 0.00 0.00 0.00
4 Candidate Biography, Campaign Finance 0.50 0.50 0.00 0.00
5 Justice, Candidate Biography, Economy 0.33 0.33 0.33 0.00
6 Candidate Biography, Jobs 0.50 0.50 0.00 0.00
7 Economy, Education, Candidate Biography 0.33 0.33 0.00 0.33
8 Economy, Civil Rights, Candidate Biography 0.33 0.33 0.33 0.00