我有一个如下所示的数据框:
SentenceID IA_ID label dt indx IA_TYPE count
1 1 This 271 1 non_target 4
1 2 is 98 2 non_target 2
1 3 an 159 3 non_target 2
1 4 example 319 4 non_target 7
1 5 of 284 5 non_target 2
1 6 a 235 6 non_target 1
1 7 data 218 7 non_target 4
1 8 file. 303 8 non_target 5
1 9 The 173 9 non_target 3
1 10 goal 387 10 target 4
1 11 is 155 11 non_target 2
1 12 to 278 12 non_target 2
1 13 extract 97 13 non_target 7
1 14 content 248 14 non_target 7
1 15 from 273 15 non_target 4
1 16 specific 225 16 non_target 8
1 17 cells 119 17 non_target 5
1 18 in 207 18 non_target 2
1 19 this 199 19 non_target 4
1 20 column. 93 20 non_target 7
2 1 The 206 21 non_target 3
2 2 cells 195 22 non_target 5
2 3 to 220 23 non_target 2
2 4 be 247 24 non_target 2
2 5 extracted 368 25 target 9
2 6 for 213 26 non_target 3
2 7 each 215 27 non_target 4
2 8 sentence 386 28 non_target 8
2 9 are 186 29 non_target 3
2 10 identified 137 30 non_target 10
2 11 by 154 31 non_target 2
2 12 an 101 32 non_target 2
2 13 ID 197 33 non_target 2
2 14 number 297 34 non_target 6
2 15 in 344 35 non_target 2
2 16 the 333 36 non_target 3
2 17 second 386 37 non_target 6
2 18 column. 346 38 non_target 7
依此类推,使用" SentenceID"的值(第一列)当新句子开始时每隔几行增加一次。我能够得到每个单词的字符数(即列中的每个单元格"标签")以及每个句子中的字符总数:
data$count <- with(data, nchar(as.character(label)))
sentence.count <- (sqldf("SELECT SentenceID, sum(count) as sentChar FROM data GROUP BY SentenceID"))
但是,那个sentence.count不包含我需要的空格。基本上,我需要添加它&#34; n-1&#34;,其中&#34; n&#34;是一个句子中的单词总数,或者每个句子ID的总行数(-1,因为在最后一个单词后没有空格可以计算)。但我似乎无法弄清楚它的语法。如果我处理单个字符串(即如果&#34;标签&#34;中的所有单词都与空格连接),而不是列的后续单元格中的一系列字符串,我似乎找到的所有选项都可以工作在数据框中。有什么想法吗?
答案 0 :(得分:2)
其中“n”是句子中的单词总数,或者每个句子ID的总行数
不应该通过像
这样的小修改来调用SQL调用 sentence.count <- sqldf("SELECT SentenceID, count(count), sum(count) as sentChar
FROM data GROUP BY SentenceID")
或者甚至
sentence.count <- sqldf("SELECT SentenceID, sum(count)+count(Count)-1 as sentChar
FROM data GROUP BY SentenceID")
答案 1 :(得分:1)
使用data.table
{{1}}
答案 2 :(得分:0)
我们也可以使用dplyr
library(dplyr)
data %>%
group_by(SentenceID) %>%
mutate(sentence.count = sum(nchar(label)) + n() - 1)