R - 计算不同单元格中字符串的字符,但包括空格

时间:2016-08-31 14:41:26

标签: r string dataframe char

我有一个如下所示的数据框:

SentenceID IA_ID      label  dt indx    IA_TYPE count
1     1       This 271    1 non_target     4
1     2         is  98    2 non_target     2
1     3         an 159    3 non_target     2
1     4    example 319    4 non_target     7
1     5         of 284    5 non_target     2
1     6          a 235    6 non_target     1
1     7       data 218    7 non_target     4
1     8      file. 303    8 non_target     5
1     9        The 173    9 non_target     3
1    10       goal 387   10     target     4
1    11         is 155   11 non_target     2
1    12         to 278   12 non_target     2
1    13    extract  97   13 non_target     7
1    14    content 248   14 non_target     7
1    15       from 273   15 non_target     4
1    16   specific 225   16 non_target     8
1    17      cells 119   17 non_target     5
1    18         in 207   18 non_target     2
1    19       this 199   19 non_target     4
1    20    column.  93   20 non_target     7
2     1        The 206   21 non_target     3
2     2      cells 195   22 non_target     5
2     3         to 220   23 non_target     2
2     4         be 247   24 non_target     2
2     5  extracted 368   25     target     9
2     6        for 213   26 non_target     3
2     7       each 215   27 non_target     4
2     8   sentence 386   28 non_target     8
2     9        are 186   29 non_target     3
2    10 identified 137   30 non_target    10
2    11         by 154   31 non_target     2
2    12         an 101   32 non_target     2
2    13         ID 197   33 non_target     2
2    14     number 297   34 non_target     6
2    15         in 344   35 non_target     2
2    16        the 333   36 non_target     3
2    17     second 386   37 non_target     6
2    18    column. 346   38 non_target     7

依此类推,使用" SentenceID"的值(第一列)当新句子开始时每隔几行增加一次。我能够得到每个单词的字符数(即列中的每个单元格"标签")以及每个句子中的字符总数:

data$count <- with(data, nchar(as.character(label)))
sentence.count <- (sqldf("SELECT SentenceID, sum(count) as sentChar FROM data GROUP BY SentenceID"))

但是,那个sentence.count不包含我需要的空格。基本上,我需要添加它&#34; n-1&#34;,其中&#34; n&#34;是一个句子中的单词总数,或者每个句子ID的总行数(-1,因为在最后一个单词后没有空格可以计算)。但我似乎无法弄清楚它的语法。如果我处理单个字符串(即如果&#34;标签&#34;中的所有单词都与空格连接),而不是列的后续单元格中的一系列字符串,我似乎找到的所有选项都可以工作在数据框中。有什么想法吗?

3 个答案:

答案 0 :(得分:2)

  

其中“n”是句子中的单词总数,或者每个句子ID的总行数

不应该通过像

这样的小修改来调用SQL调用
 sentence.count <- sqldf("SELECT SentenceID, count(count), sum(count) as sentChar 
                          FROM data GROUP BY SentenceID")

或者甚至

 sentence.count <- sqldf("SELECT SentenceID, sum(count)+count(Count)-1 as sentChar 
                          FROM data GROUP BY SentenceID")

答案 1 :(得分:1)

使用data.table

{{1}}

答案 2 :(得分:0)

我们也可以使用dplyr

library(dplyr)
data %>%
    group_by(SentenceID) %>%
    mutate(sentence.count = sum(nchar(label)) + n() - 1)