Question

我正在尝试进行一些文本处理，需要重新编码句子的单词，以便在新变量中以特定方式识别目标单词。例如，给定一个看起来像这样的数据框......

subj <- c("1", "1", "1", "2", "2", "2", "2", "2")
condition <- c("A", "A", "A", "B", "B", "B", "B", "B")
sentence <- c("1", "1", "1", "2", "2", "2", "2", "2")
word <- c("I", "like", "dogs.", "We", "don't", "like", "this", "song.")
d <- data.frame(subj,condition, sentence, word)

 subj condition sentence  word
 1         A        1     I
 1         A        1     like
 1         A        1     dogs.
 2         B        2     We
 2         B        2     don't
 2         B        2     like
 2         B        2     this
 2         B        2     song.

我需要创建一个新列，目标单词的每个实例（在本例中，当d $ word =“like”）标记为0，并且句子块中“like”之前的所有单词都会减少并且全部“喜欢”增量后的单词。每个主题都有多个句子，句子因条件而异，因此循环需要考虑每个主语，每个句子的目标词的实例。最终结果看起来应该是这样的。

 subj condition sentence  word   position
 1         A        1     I        -1
 1         A        1     like      0
 1         A        1     dogs.     1
 2         B        2     We       -2
 2         B        2     don't    -1
 2         B        2     like      0
 2         B        2     this      1
 2         B        2     song.     2

对不起，如果问题措辞不好，我希望这是有道理的！请注意，目标不在每个句子中的相同位置（相对于句子的开头）。我对R很新，可以弄清楚如何增加或减少，但不能在每个句子块中做两件事。有关最佳方法的任何建议吗？非常感谢！

Answer 1

您可以添加一个索引，然后将其用于相对位置使用data.table可以轻松地将其分解为sentence

library(data.table)
DT <- data.table(indx=1:nrow(d), d, key="indx")

DT[, position:=(indx - indx[word=="like"]), by=sentence]

# Results
DT
#    indx subj condition sentence  word position
# 1:    1    1         A        1     I       -1
# 2:    2    1         A        1  like        0
# 3:    3    1         A        1 dogs.        1
# 4:    4    2         B        2    We       -2
# 5:    5    2         B        2 don't       -1
# 6:    6    2         B        2  like        0
# 7:    7    2         B        2  this        1
# 8:    8    2         B        2 song.        2

UDATE：

如果您的语法不正确，您可能希望使用grepl代替==

DT[, position:=(indx - indx[grepl("like", word)]), by=sentence]

Answer 2

我认为在文本处理中，避免让文本条目成为因素是明智的。在这种情况下，我使用了as.character但我建议设置options(stringsAsFactors=FALSE);

d$position <- with( d, ave(as.character(word), sentence, 
                               FUN=function(x) seq_along(x) - which(x=="like") ) )
> d
  subj condition sentence  word position
1    1         A        1     I       -1
2    1         A        1  like        0
3    1         A        1 dogs.        1
4    2         B        2    We       -2
5    2         B        2 don't       -1
6    2         B        2  like        0
7    2         B        2  this        1
8    2         B        2 song.        2

Answer 3

使用plyr

的惯例解决方案

 ddply(d, .(subj, condition, sentence), transform, 
   position = seq_along(word) - which(word == 'like'))

R中的目标变量重新编码

3 个答案:

UDATE：