从数据框中的变量中提取文本并创建新的向量

时间:2016-05-17 16:24:05

标签: r replace dataframe

我有一个来自问卷的数据库。这个数据库有一些复杂的长文本,为了我的目的,我还必须在我的分析后使用它们作为变量。

我分析的数据帧类型的示例如下:

cnt  <-as.factor(c("Country 1", "Country 2", "Country 3", "Country 1", "Country 2", "Country 3" ))
bnk  <-as.factor(c("bank 1", "bank 2", "bank 3", "bank 1", "bank 2", "bank 3" ))
qst  <-as.factor(c(" Q.1 - some long question?", " Q.1 - some long question?", " Q.1 - some long question?", "Q.27 <U+FFFD> another long question?","Q.27 <U+FFFD> another long question?","Q.27 <U+FFFD> another long question?" ))
ans  <-as.numeric(c(1,1,2,1,2,3))
df   <-data.frame(cnt, bnk, qst,ans)
names(df) <- c("Country", "Institute", "Question", "Answer")
head(df)

  Country Institute                             Question Answer
1 Country 1    bank 1            Q.1 - some long question?      1
2 Country 2    bank 2            Q.1 - some long question?      1
3 Country 3    bank 3            Q.1 - some long question?      2
4 Country 1    bank 1 Q.27 <U+FFFD> another long question?      1
5 Country 2    bank 2 Q.27 <U+FFFD> another long question?      2
6 Country 3    bank 3 Q.27 <U+FFFD> another long question?      3

正如您在变量&#34;问题&#34;中看到的那样,无论问题是什么,都有一种模式:所有文本都以Q.number开头

仅供参考,不同问题的数量为49.

我想在这里做几件事(或步骤):

  1. 首先,我想创建一个新的向量,我可以索引问题。所以,例如我的数据框就像这样:
  2. df&lt; -mutate(df,qs = c(&#34; q1&#34;,&#34; q1&#34;,&#34; q1&#34;,&#34; q27&#34;, &#34; q27&#34;,&#34; q27&#34;))

      Country Institute                             Question Answer qs
    1 Country 1    bank 1            Q.1 - some long question?      1 q1
    2 Country 2    bank 2            Q.1 - some long question?      1 q1
    3 Country 3    bank 3            Q.1 - some long question?      2 q1
    4 Country 1    bank 1 Q.27 <U+FFFD> another long question?      1 q27
    5 Country 2    bank 2 Q.27 <U+FFFD> another long question?      2 q27
    6 Country 3    bank 3 Q.27 <U+FFFD> another long question?      3 q27
    
    1. 然后,我想创建一个类似于步骤1的新向量,但索引仅包含数字。这是因为我想将这个额外的向量视为我想用作标签的因素,每个问题的一部分不包括&#34; Q。&#34;。为此,我想我需要搜索变量&#34; Question&#34;并进行相关提取。
    2. 因此,最终数据框必须如下所示:

      Country Institute                             Question Answer qs qs_inx                 labels
      1 Country 1    bank 1            Q.1 - some long question?      1 q1      1   some long question? 
      2 Country 2    bank 2            Q.1 - some long question?      1 q1      1    some long question?
      3 Country 3    bank 3            Q.1 - some long question?      2 q1      1    some long question?
      4 Country 1    bank 1 Q.27 <U+FFFD> another long question?      1 q2      2 another long question?
      5 Country 2    bank 2 Q.27 <U+FFFD> another long question?      2 q2      2 another long question?
      6 Country 3    bank 3 Q.27 <U+FFFD> another long question?      3 q2      2 another long question?
      

1 个答案:

答案 0 :(得分:1)

如果理解正确,您需要df$Question的两份副本,但每份副本中都使用不同的标签。

df$qs_inx <- df$Question
df$labels <- df$Question

levels(df$qs_inx) <- sub('[ ]*Q\\.([0-9]+).*', 'q\\1', levels(df$Question))
levels(df$labels) <- sub('[ ]*Q\\.(.*)', '\\1', levels(df$Question))