R data.table:将子标题重新格式化为单独的列

时间:2015-11-14 20:33:34

标签: r data.table data-cleaning

修订previous question以包含边缘情况。

我正在尝试通过为其提供更好的分类标签来清理犯罪数据的数据集。该表的样本如下所示:

d <- as.data.table(read.csv('[filepath]'))
print(d)

Classifications                    ucr_ncic_code
SOVEREIGNTY                        NA
Treason                            101
Treason Misprison                  102
Espionage                          103
Sovereignty                        199
MILITARY (restricted to agencies)  NA
Military Desertion                 201
Military                           299 
IMMIGRATION                        NA
Illegal Entry                      301
False Citizenship                  302
Smuggling Aliens                   303
Immigration                        399
CRIMES AGAINST PERSON              7099
HOMICIDE                           NA
Homicide Family-Gun                901
Homicide Family-Weapon             902
Homicide Nonfam-Gun                903
PROPERTY CRIMES                    7199
<TRUNCATED>

正如您所看到的,在原始数据集中,更广泛的犯罪分类类别被格式化为全部大写标题,并且大多数具有NA代码(例如SOVEREIGNTY NA)。但是,某些标头包含非大写字母(例如MILITARY (restricted to agencies)),而某些标头没有任何子类别,因此具有有效代码(例如CRIMES AGAINST PERSON 7099)。我想要做的是重新格式化数据,以便这些标题是表中自己的分类列。

这是我最初的解决方案,我几乎可以肯定这不是最好的方法,但会产生预期的结果:

d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)] 
#the substring is for some edge cases that I don't show here

setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]

prev.row <- NA
for (i in seq(1,d[,.N])) {
  current.row <- d$Category[i]  
  if (is.na(current.row) & !(is.na(prev.row))){
    d$Category[i] <- prev.row
  } 
  prev.row <- d$Category[i]
}

#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]

print(d)

Classifications   ucr_ncic_code   Category
Treason                 101       SOVEREIGNTY
Treason Misprison       102       SOVEREIGNTY
Espionage               103       SOVEREIGNTY
Sovereignty             199       SOVEREIGNTY
Military Desertion      201       MILITARY (restricted to agencies)
Military                299       MILITARY (restricted to agencies)
Illegal Entry           301       IMMIGRATION
False Citizenship       302       IMMIGRATION
Smuggling Aliens        303       IMMIGRATION
Immigration             399       IMMIGRATION
CRIMES AGAINST PERSON   7099      CRIMES AGAINST PERSON
Homicide Family-Gun     901       HOMICIDE
Homicide Family-Weapon  902       HOMICIDE
Homicide Nonfam-Gun     903       HOMICIDE
PROPERTY CRIMES         7099      PROPERTY CRIMES
<TRUNCATED>

利用data.table包进行格式更改有什么更好的方法?我猜测有一种更好的方法来复制单元格而不是我设计的for-loop,但是许多更简单的解决方案受到标题及其代码中字符格式不一致或缺乏(见previous question)的阻碍。 )。

1 个答案:

答案 0 :(得分:3)

它应该只占一行:

dt[,Category := Classifications[(x=grepl("^[A-Z]{2,}", Classifications))][cumsum(x)]][]
#                       Classifications ucr_ncic_code                          Category
#  1:                       SOVEREIGNTY            NA                       SOVEREIGNTY
#  2:                           Treason           101                       SOVEREIGNTY
#  3:                 Treason Misprison           102                       SOVEREIGNTY
#  4:                         Espionage           103                       SOVEREIGNTY
#  5:                       Sovereignty           199                       SOVEREIGNTY
#  6: MILITARY (restricted to agencies)            NA MILITARY (restricted to agencies)
#  7:                Military Desertion           201 MILITARY (restricted to agencies)
#  8:                          Military           299 MILITARY (restricted to agencies)
#  9:                       IMMIGRATION            NA                       IMMIGRATION
# 10:                     Illegal Entry           301                       IMMIGRATION

<强>解释

尝试创建标记更改类别的索引。我们需要一种能够识别每个变化的模式,例如"^[A-Z]{2,}"。这是一个简单的正则表达式,它匹配Classifications开头的两个或多个大写字母。在识别标题行之后,我们可以获取该索引的累积总和。一开始听起来很奇怪,但幕后发生的事情是从逻辑到数字的转换。每个TRUE都将成为1。当加在一起时,它成为一个子集表索引(即1 1 1 2 2 3 3 3...):

我还应该在那里提一个R技巧。我创建了一个新变量并在同一行中使用它。例如,您可以(x=1+1) + x在R中获取4