Question

的答案提供

数据是这样的：

Name                                   Text idx             c_org
1:   John                      Text contains MIT   1               MIT
2: Sussan     some text with Stanford University   2          Stanford
3:   Bill He graduated from Yale, MIT, Stanford.   3 MIT,Yale,Stanford
4:   Bill                              some text   4

对于列c_org，如果有多个值（如观察3 MIT,Yale,Stanford），我会将第一个值MIT作为列值。结果应该是这样的：

Name                                   Text idx             NewOrg
1:   John                      Text contains MIT   1               MIT
2: Sussan     some text with Stanford University   2          Stanford
3:   Bill He graduated from Yale, MIT, Stanford.   3               MIT
4:   Bill                              some text   4

（请注意，在c_org列中，某些字段包含多个值，有些字段甚至为空。在预期输出中，如果只有一个值，请保留;如果超过一，保留第一个;如果为空，保持空。）

我试过了（但失败了）：

DT[ , str_split(c_org, ",")[[1]][1]]

我认为在一个字段中遇到多个值的数据是很常见的。如何在data.table中完成？（或者以其他方式，如果解决方案优于data.table）

Answer 1

我们可以使用sub匹配模式,后跟一个或多个字符（.*），直到＆＃中字符串的结尾（$） 39; c_org＆＃39;列，并将其替换为''。可以分配输出（:=）以创建列＆＃39; NewOrg＆＃39;并分配＆＃39; c_org＆＃39;为NULL。

DT[, NewOrg := sub(',.*$', '', c_org)][,c_org:= NULL]
DT
#     Name                                   Text idx   NewOrg
#1:   John                      Text contains MIT   1      MIT
#2: Sussan     some text with Stanford University   2 Stanford
#3:   Bill He graduated from Yale, MIT, Stanford.   3      MIT
#4:   Bill                              some text   4

data.table v1.9.6+中的其他选项是tstrsplit

DT[, NewOrg := tstrsplit(c_org, ',', fill='')[[1]]][, c_org:= NULL]

如何在data.table中的列中选择第一个逗号分隔值？

1 个答案: