如何根据R中的记录标识符分配唯一ID?

时间:2015-02-14 17:59:52

标签: r data-import

我的使命:根据电影数据计算预算和收入数字。

我正在从文本文件中读取数据,该文件基本上采用以下格式:

MV,Movie 1 Name
BT,Budget for Movie 1
GR,Gross Revenue Movie 1

但数据可能包含也可能不包含BT或GR,或者有时包含多个数据,例如:

MV,Movie1
BT,1000000
GR,500000 (week1)
GR,500000 (week2)
GR,500000 (week3)
GR,500000 (week1)
MV,Movie2
BT,10000
GR,50000 (week1)
GR,500000 (week2)
MV,Movie3
MV,Movie4
BT,1000000

我想要创建的数据框如下所示:

mID  recType  recData
  1  MV       Movie1
  1  BT       1000000
  1  GR       500000 (week1)
  1  GR       500000 (week2)
  1  GR       500000 (week3)
  1  GR       500000 (week1)
  2  MV       Movie2
  2  BT       10000
  2  GR       50000 (week1)
  2  GR       500000 (week2)
  3  MV       Movie3
  4  MV       Movie4
  4  BT       1000000

我的程序员说只是在java或.NET中编写一个数据清理应用程序,以便在将数据导入R之前清理数据,但我想知道互联网的集体智慧是否可以帮助我。

为超过90K的电影写一个循环,在处理过程中非常讨厌。

注意:最终目标是将此数据用作电影盈利能力分类的主要来源,并将其与外部文件,演员和其他数据交叉引用。

(IMDB需要更好的数据设置)

谢谢!

1 个答案:

答案 0 :(得分:0)

尝试

df1$mID <- cumsum(grepl('^Movie', df1$recData))
#df1$mID <- cumsum(df1$recType=='MV')
df1[,c(3,1:2)]
#   mID recType        recData
#1    1      MV         Movie1
#2    1      BT        1000000
#3    1      GR 500000 (week1)
#4    1      GR 500000 (week2)
#5    1      GR 500000 (week3)
#6    1      GR 500000 (week1)
#7    2      MV         Movie2
#8    2      BT          10000
#9    2      GR  50000 (week1)
#10   2      GR 500000 (week2)
#11   3      MV         Movie3
#12   4      MV         Movie4
#13   4      BT        1000000

或使用data.table(会更快)

library(data.table)
setDT(df1)[, mID:= cumsum(recType=='MV')][]