我有一个R dataframe列,其中包含以下文字
ClientID Recom
ABC 1:Teck|Scrip:ABC|Call:Buy||2:NA|Scrip:NA|Call:NA
DEF 1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA||5:NA|Scrip:NA|Call:NA
WER 1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA
我想从上面的模式中移除NA。期望的数据帧将是
ClientID Recom
ABC 1:Teck|Scrip:ABC|Call:Buy||
DEF 1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||
WER 1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||
我在R中使用了以下gsub,但它似乎不起作用。
df$Recom <- gsub("\\s*[|]+\\NA\\s+.*", "", df$Recom)
我该怎么做?
答案 0 :(得分:1)
设置字符串的方式似乎在第一个NA之后具有所有NA。如果是这种情况,那么,
gsub('[0-9]+:NA.*', '', df$Recom)
您还可以使用strsplit
和grepl
,
sapply(strsplit(df$Recom, '\\|\\|'), function(i)paste(i[!grepl('NA', i)], collapse = '||'))
答案 1 :(得分:1)
df$Recom <- lapply( strsplit( df$Recom, split = '||', fixed = TRUE),
grep,
pattern = 'NA',
invert = TRUE,
value = TRUE )
df
# ClientID Recom
# 1 ABC 1:Teck|Scrip:ABC|Call:Buy
# 2 DEF 1:CG|Scrip:WERT|Call:Buy, 2:CDGS|Scrip:QWS|Call:Buy, 3:IT|Scrip:QAS|Call:Buy
# 3 WER 1:CDGS|Scrip:WERT|Call:Sell, 2:IT|Scrip:QWS|Call:Buy, 3:Industrials|Scrip:QAS|Call:Buy
数据:强>
df <- structure(list(ClientID = c("ABC", "DEF", "WER"),
Recom = c("1:Teck|Scrip:ABC|Call:Buy||2:NA|Scrip:NA|Call:NA",
"1:CG|Scrip:WERT|Call:Buy||2:CDGS|Scrip:QWS|Call:Buy||3:IT|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA||5:NA|Scrip:NA|Call:NA",
"1:CDGS|Scrip:WERT|Call:Sell||2:IT|Scrip:QWS|Call:Buy||3:Industrials|Scrip:QAS|Call:Buy||4:NA|Scrip:NA|Call:NA"
)),
.Names = c("ClientID", "Recom"),
row.names = c(NA, -3L),
class = "data.frame")
答案 2 :(得分:1)
您似乎在Recom
- 列中嵌入了多种信息。要清理数据,您还可以执行以下操作:
library(splitstackshape) # will automatically also load the 'data.table' package
dt <- cSplit(
cSplit(
cSplit(df, 'Recom', sep = '||', 'long'),
'Recom', sep = '|', 'long'
),
'Recom', sep = ':', 'wide'
)[Recom_2 != 'NA'
][, num := cumsum(grepl('\\d+', Recom_1)), ClientID
][grepl('\\d+', Recom_1), Recom_1 := 'kind']
dcast(dt, ClientID + num ~ Recom_1, value.var = 'Recom_2')
给出:
ClientID num Call Scrip kind
1: ABC 1 Buy ABC Teck
2: DEF 1 Buy WERT CG
3: DEF 2 Buy QWS CDGS
4: DEF 3 Buy QAS IT
5: WER 1 Sell WERT CDGS
6: WER 2 Buy QWS IT
7: WER 3 Buy QAS Industrials