我试图以字符串系数执行字词计数。
我有一个名为Names的因子。以下示例:
[1] GP - Hyperion Planning Upgrade
[2] Application Support Renewal
[3] Oracle EBS upgrade 11.5 to 12.1
[4] Bluemix Services
[5] Cognos 11 Upgrade
我还有一个名为terms
的词典术语列表:
[1] "IBM" "Oracle" "SQL Server" "Cognos" "Azure"
我需要R从terms
列表中创建一个数据框,并从Names
因子创建每个团队的总数。例如:
term count
1 IBM 3
2 Oracle 6
3 SQL Server 0
4 Cognos 2
5 Azure 9
值得注意的是:该术语可以在一个名称中多次出现。它算作一次。
我想问一下,是否有人可以从中得到任何关于此的例子。感谢。
答案 0 :(得分:1)
您可以尝试此操作(稍微更改向量Names
并假设您需要不区分大小写的匹配项):
# input
Names <- as.character(Names)
Names
#[1] IBM GP - Hyperion IBM Planning Upgrade IBM"
#[2] Application Support Renewal"
#[3] Oracle EBS upgrade 11.5 to 12.1"
#[4] Bluemix Services IBM"
#[5] Cognos 11 Upgrade"
terms <- c("IBM", "Oracle", "SQL Server", "Cognos", "Azure")
vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
df <- +(vgrepl(tolower(terms), tolower(Names))) # case insensitive
df
# ibm oracle sql server cognos azure
#[1,] 1 0 0 0 0
#[2,] 0 0 0 0 0
#[3,] 0 1 0 0 0
#[4,] 1 0 0 0 0
#[5,] 0 0 0 1 0
colSums(df)
# ibm oracle sql server cognos azure
# 2 1 0 1 0
data.frame(count=colSums(df))
# count
#ibm 2
#oracle 1
#sql server 0
#cognos 1
#azure 0
df <- data.frame(count=colSums(df))
df <- cbind.data.frame(terms=rownames(df), df)
df
# terms count
#ibm ibm 2
#oracle oracle 1
#sql server sql server 0
#cognos cognos 1
#azure azure 0
答案 1 :(得分:1)
这是一个避免正则表达式支持match
的示例:
names <- c(
"GP - Hyperion Planning Upgrade",
"Application Support Renewal",
"Oracle EBS upgrade 11.5 to 12.1",
"Bluemix Services",
"Cognos 11 Upgrade")
terms <- tolower(c("IBM", "Oracle", "SQL Server", "Cognos", "Azure"))
## Split your names on whitespace and match each token to the terms
counts <- lapply(strsplit(tolower(names), "\\s+"), match, terms, 0)
## index the terms using the match indices and table it
table(terms[unlist(counts)])
cognos oracle
1 1