我有一个以下格式的字符串:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
我的目标是在数据框中得到最终结果如下:
first_name cust_id start_date end_date performance cust_notes
James(Mr) 98503 2015-05-20 2015-05-20 best ZZW_LG,WGE,zonaire
我运行了以下代码:
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire),
StartDate=2015-05-20, EndDate=2015-05-20, performance=best")
split_by_comma <- strsplit(a,",")
split_by_equal <- lapply(split_by_comma,strsplit,"=")
由于custid有额外的逗号和括号,我没有得到理想的结果。
请注意,名字中的括号是真实的,需要原样。
答案 0 :(得分:1)
你需要拆分。
,(?![^()]*\\))
您需要lookahead
。这不会在,
内按()
分割。请参阅演示。
https://regex101.com/r/uF4oY4/82
要获得所需的结果,请使用
split_by_comma <- strsplit(a,",(?![^()]*\\))",perl=TRUE)
split_by_equal <- lapply(split_by_comma,strsplit,"=")
答案 1 :(得分:0)
如果您的字符串格式为true,这可能是一个快速解决方案:
library(httr)
a <- c("first_name=James(Mr), cust_id=98503(ZZW_LG,WGE,zonaire), StartDate=2015-05-20,
EndDate=2015-05-20, performance=best")
dat <- data.frame(parse_url(sprintf("?%s", gsub(",[[:space:]]+", "&", a)))$query,
stringsAsFactors=FALSE)
library(tidyr)
library(dplyr)
mutate(separate(dat, cust_id, into=c("cust_id", "cust_notes"), sep="\\("),
cust_notes=gsub("\\)", "", cust_notes))
## first_name cust_id cust_notes StartDate EndDate performance
## 1 James(Mr) 98503 ZZW_LG,WGE,zonaire 2015-05-20 2015-05-20 best
外推:
gsub(",[[:space:]]+", "&", a)
使参数看起来像是URL查询字符串的组成部分。sprintf(…)
使其看起来像一个实际的查询字符串parse_url
(来自httr
)会将键/值对分开并将其粘贴在返回列表中的列表(名为query
)中data.frame
会好的...... separate
会将cust_id
的{{1}}列拆分为两列(
将删除新mutate
列)
这里的整个事情是&#34;管道&#34;:
cust_notes
与外推匹配,并且(IMO)更容易遵循。
答案 2 :(得分:0)
迟到的回复,但发布了它,因为它非常简单易懂,无需使用任何其他软件包
rawdf = read.csv("<your file path>", header = F, sep = ",", stringsAsFactors = F)
# Get the first row of the dataframe and transpose it into a column of a df
colnames = data.frame(t(rawdf[1,]))
# Split the values of the single column df created above into its key value
# pairs which are separated by '=' and save in a vector
colnames = unlist(strsplit(as.character(colnames$X1), "="))
# Pick up all the odd indexed values from the above vector (all odd places
# are colnames and even places the values associated with them)
colnames = colnames[seq(1,length(colnames),2)]
# Assign the extracted column names from the vector above to your original data frame
colnames(rawdf) = colnames
# Use the regex to extract the value in each field of the original df by
# replacing the 'Key=' pattern present in each field with an empty string
for(i in 1:dim(rawdf)[2]) rawdf[,i] = gsub(paste(colnames[i],"=",sep=""), "", rawdf[,i])