修订previous question以包含边缘情况。
我正在尝试通过为其提供更好的分类标签来清理犯罪数据的数据集。该表的样本如下所示:
d <- as.data.table(read.csv('[filepath]'))
print(d)
Classifications ucr_ncic_code
SOVEREIGNTY NA
Treason 101
Treason Misprison 102
Espionage 103
Sovereignty 199
MILITARY (restricted to agencies) NA
Military Desertion 201
Military 299
IMMIGRATION NA
Illegal Entry 301
False Citizenship 302
Smuggling Aliens 303
Immigration 399
CRIMES AGAINST PERSON 7099
HOMICIDE NA
Homicide Family-Gun 901
Homicide Family-Weapon 902
Homicide Nonfam-Gun 903
PROPERTY CRIMES 7199
<TRUNCATED>
正如您所看到的,在原始数据集中,更广泛的犯罪分类类别被格式化为全部大写标题,并且大多数具有NA代码(例如SOVEREIGNTY NA
)。但是,某些标头包含非大写字母(例如MILITARY (restricted to agencies)
),而某些标头没有任何子类别,因此具有有效代码(例如CRIMES AGAINST PERSON 7099
)。我想要做的是重新格式化数据,以便这些标题是表中自己的分类列。
这是我最初的解决方案,我几乎可以肯定这不是最好的方法,但会产生预期的结果:
d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)]
#the substring is for some edge cases that I don't show here
setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]
prev.row <- NA
for (i in seq(1,d[,.N])) {
current.row <- d$Category[i]
if (is.na(current.row) & !(is.na(prev.row))){
d$Category[i] <- prev.row
}
prev.row <- d$Category[i]
}
#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]
print(d)
Classifications ucr_ncic_code Category
Treason 101 SOVEREIGNTY
Treason Misprison 102 SOVEREIGNTY
Espionage 103 SOVEREIGNTY
Sovereignty 199 SOVEREIGNTY
Military Desertion 201 MILITARY (restricted to agencies)
Military 299 MILITARY (restricted to agencies)
Illegal Entry 301 IMMIGRATION
False Citizenship 302 IMMIGRATION
Smuggling Aliens 303 IMMIGRATION
Immigration 399 IMMIGRATION
CRIMES AGAINST PERSON 7099 CRIMES AGAINST PERSON
Homicide Family-Gun 901 HOMICIDE
Homicide Family-Weapon 902 HOMICIDE
Homicide Nonfam-Gun 903 HOMICIDE
PROPERTY CRIMES 7099 PROPERTY CRIMES
<TRUNCATED>
利用data.table包进行格式更改有什么更好的方法?我猜测有一种更好的方法来复制单元格而不是我设计的for-loop,但是许多更简单的解决方案受到标题及其代码中字符格式不一致或缺乏(见previous question)的阻碍。 )。
答案 0 :(得分:3)
它应该只占一行:
dt[,Category := Classifications[(x=grepl("^[A-Z]{2,}", Classifications))][cumsum(x)]][]
# Classifications ucr_ncic_code Category
# 1: SOVEREIGNTY NA SOVEREIGNTY
# 2: Treason 101 SOVEREIGNTY
# 3: Treason Misprison 102 SOVEREIGNTY
# 4: Espionage 103 SOVEREIGNTY
# 5: Sovereignty 199 SOVEREIGNTY
# 6: MILITARY (restricted to agencies) NA MILITARY (restricted to agencies)
# 7: Military Desertion 201 MILITARY (restricted to agencies)
# 8: Military 299 MILITARY (restricted to agencies)
# 9: IMMIGRATION NA IMMIGRATION
# 10: Illegal Entry 301 IMMIGRATION
<强>解释强>
尝试创建标记更改类别的索引。我们需要一种能够识别每个变化的模式,例如"^[A-Z]{2,}"
。这是一个简单的正则表达式,它匹配Classifications
开头的两个或多个大写字母。在识别标题行之后,我们可以获取该索引的累积总和。一开始听起来很奇怪,但幕后发生的事情是从逻辑到数字的转换。每个TRUE
都将成为1
。当加在一起时,它成为一个子集表索引(即1 1 1 2 2 3 3 3...
):
我还应该在那里提一个R技巧。我创建了一个新变量并在同一行中使用它。例如,您可以(x=1+1) + x
在R中获取4
。