使用R中的列的正则表达式或子字符串提取特定单词

时间:2018-05-02 14:46:32

标签: r regex text substr

我有以下数据:

    Opex_Spend_Month    Opex_Spend_YTD  Major_Category  NBS_Region  Sub_Category
92179.84            113542.84       Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
297.82              82392.82        Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
13974.8             34917.8         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
138.6               63125.6         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER
NA                  73097           Contingent Labour   EUROPE  TEMP:MSP NON IT
NA                  96035           Contingent Labour   EUROPE  TEMP:MSP NON IT
1388.65             68934.65        Contingent Labour   EUROPE  TEMP:MSP NON IT
5393.76             18748.76        Contingent Labour   EUROPE  TEMP:MSP IT
528.38              82195.38        Contingent Labour   EUROPE  TEMP:MSP IT
22369               95468           Contingent Labour   EUROPE  TEMP:MSP IT

从Sub_Category列我希望能够选择Cont Worker,Non IT&的最后部分。我和我不确定要使用什么样的正则表达式或子字符串函数。

所需输出

Opex_Spend_Month    Opex_Spend_YTD  Major_Category  NBS_Region  Sub_Category            Category
92179.84            113542.84       Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
297.82              82392.82        Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
13974.8             34917.8         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
138.6               63125.6         Contingent Labour   EUROPE  TEMP:OTH.CONT.WORKER    Cont Worker
NA                  73097           Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
NA                  96035           Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
1388.65             68934.65        Contingent Labour   EUROPE  TEMP:MSP NON IT         Non IT
5393.76             18748.76        Contingent Labour   EUROPE  TEMP:MSP IT             IT
528.38              82195.38        Contingent Labour   EUROPE  TEMP:MSP IT             IT
22369               95468           Contingent Labour   EUROPE  TEMP:MSP IT             IT

有人可以帮我解决这个问题吗?

3 个答案:

答案 0 :(得分:1)

我们可以使用library(stringr) str_extract(df1$Sub_Category, "(CONT\\.WORKER|NON IT|IT)$")

{{1}}

答案 1 :(得分:1)

You can do:

 gsub(".*?(\\.|\\s)(\\w+)","\\2 ",dat$Sub_Category)

这是一个例子:只需调用最后两列(5:6),看看会发生什么:

transform(dat,category=gsub(".*?(\\.|\\s)(\\w+)","\\2 ",Sub_Category))[5:6]
           Sub_Category     category
1  TEMP:OTH.CONT.WORKER CONT WORKER 
2  TEMP:OTH.CONT.WORKER CONT WORKER 
3  TEMP:OTH.CONT.WORKER CONT WORKER 
4  TEMP:OTH.CONT.WORKER CONT WORKER 
5       TEMP:MSP NON IT      NON IT 
6       TEMP:MSP NON IT      NON IT 
7       TEMP:MSP NON IT      NON IT 
8           TEMP:MSP IT          IT 
9           TEMP:MSP IT          IT 
10          TEMP:MSP IT          IT 

答案 2 :(得分:0)

在基地R:

df$Category = trimws(gsub('([A-Z]+:[A-Z]+|\\.)', ' ', df$Sub_Category))