如何从字符串中删除NA模式

时间:2017-04-17 07:20:45

标签: r string

我有一个R dataframe列,其中包含以下文字

ClientID          Recom
 ABC              1:Teck|Scrip:ABC|Call:Buy||2:NA|Scrip:NA|Call:NA
 DEF              1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA||5:NA|Scrip:NA|Call:NA
 WER              1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA

我想从上面的模式中移除NA。期望的数据帧将是

ClientID          Recom
 ABC              1:Teck|Scrip:ABC|Call:Buy||
 DEF              1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||
 WER              1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||

我在R中使用了以下gsub,但它似乎不起作用。

df$Recom <- gsub("\\s*[|]+\\NA\\s+.*", "", df$Recom)

我该怎么做?

3 个答案:

答案 0 :(得分:1)

设置字符串的方式似乎在第一个NA之后具有所有NA。如果是这种情况,那么,

gsub('[0-9]+:NA.*', '', df$Recom)

您还可以使用strsplitgrepl

sapply(strsplit(df$Recom, '\\|\\|'), function(i)paste(i[!grepl('NA', i)], collapse = '||'))

答案 1 :(得分:1)

df$Recom <- lapply( strsplit( df$Recom, split = '||', fixed = TRUE),
                    grep, 
                    pattern = 'NA',
                    invert = TRUE,
                    value = TRUE )

df
#   ClientID   Recom
# 1      ABC   1:Teck|Scrip:ABC|Call:Buy
# 2      DEF   1:CG|Scrip:WERT|Call:Buy, 2:CDGS|Scrip:QWS|Call:Buy, 3:IT|Scrip:QAS|Call:Buy
# 3      WER   1:CDGS|Scrip:WERT|Call:Sell, 2:IT|Scrip:QWS|Call:Buy, 3:Industrials|Scrip:QAS|Call:Buy

数据:

df <- structure(list(ClientID = c("ABC", "DEF", "WER"), 
                     Recom = c("1:Teck|Scrip:ABC|Call:Buy||2:NA|Scrip:NA|Call:NA", 
                               "1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA||5:NA|Scrip:NA|Call:NA", 
                               "1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA"
                     )), 
                .Names = c("ClientID", "Recom"), 
                row.names = c(NA, -3L), 
                class = "data.frame")

答案 2 :(得分:1)

您似乎在Recom - 列中嵌入了多种信息。要清理数据,您还可以执行以下操作:

library(splitstackshape) # will automatically also load the 'data.table' package
dt <- cSplit(
        cSplit(
          cSplit(df, 'Recom', sep = '||', 'long'), 
          'Recom', sep = '|', 'long'
        ),
        'Recom', sep = ':', 'wide'
      )[Recom_2 != 'NA'
        ][, num := cumsum(grepl('\\d+', Recom_1)), ClientID
          ][grepl('\\d+', Recom_1), Recom_1 := 'kind']

dcast(dt, ClientID + num ~ Recom_1, value.var = 'Recom_2')

给出:

   ClientID num Call Scrip        kind
1:      ABC   1  Buy   ABC        Teck
2:      DEF   1  Buy  WERT          CG
3:      DEF   2  Buy   QWS        CDGS
4:      DEF   3  Buy   QAS          IT
5:      WER   1 Sell  WERT        CDGS
6:      WER   2  Buy   QWS          IT
7:      WER   3  Buy   QAS Industrials