在R

时间:2015-07-20 09:57:27

标签: r rstudio text-mining text-analysis

我有以下一行

    x<-"CUST_Id_8Name:Mr.Praveen KumarDOB:Mother's Name:Contact Num:Email address:Owns Car:Products held with Bank:Company Name:Salary per. month:Background:"

我想提取&#34; CUST_Id_8&#34;,&#34;先生。 Praveen Kumar&#34;以及在DOB之后写的任何内容:母亲的名字:联系人Num:等等,存储在客户ID,姓名,DOB等变量中。

请帮忙。

我用过

    strsplit(x, ":")

但结果是包含文本的列表。但如果变量名后面没有任何内容,我需要空白。

any1可以告诉如何在两个单词之间提取字符串。就像我想提取&#34;先生。 Praveen Kumar&#34;在Name:和DOB之间

2 个答案:

答案 0 :(得分:3)

您可以使用regexecregmatches将各种数据项拉出为子字符串。这是一个有效的例子:

示例数据

txt <- c("CUST_Id_8Name:Mr.Praveen KumarDOB:Mother's Name:Contact Num:Email address:Owns Car:Products held with Bank:Company Name:Salary per. month:Background:",
         "CUST_Id_15Name:Mr.Joe JohnsonDOB:01/02/1973Mother's Name:BarbaraContact Num:0123 456789Email address:joe@joesville.comOwns Car:YesProducts held with Bank:Savings, CurrentCompany Name:Joes villeSalary per. month:$100000Background:shady")

要匹配的模式:

pattern <- "CUST_Id_(.*)Name:(.*)DOB:(.*)Mother's Name:(.*)Contact Num:(.*)Email address:(.*)Owns Car:(.*)Products held with Bank:(.*)Company Name:(.*)Salary per. month:(.*)Background:(.*)"
var_names <- strsplit(pattern, "[:_]\\(\\.\\*\\)")[[1]]

运行匹配:

data <- as.data.frame(do.call("rbind", regmatches(txt, regexec(pattern, txt))))[, -1]
colnames(data) <- var_names

输出:

#  CUST_Id             Name        DOB Mother's Name Contact Num
#1       8 Mr.Praveen Kumar                                     
#2      15   Mr.Joe Johnson 01/02/1973       Barbara 0123 456789
#      Email address Owns Car Products held with Bank Company Name
#1                                                                
#2 joe@joesville.com      Yes        Savings, Current   Joes ville
#  Salary per. month Background
#1                             
#2           $100000      shady

答案 1 :(得分:2)

如果您事先知道密钥,则可以提取如下值:

keys <- c("CUST_Id_8Name", "DOB", "Mother's Name", 
  "Contact Num", "Email address", "Owns Car", "Products held with Bank", 
  "Company Name", "Salary per. month", "Background")
cbind(keys, values = sub("^:", "", strsplit(x, paste0(keys, collapse = "|"))[[1]][-1]))
#                 keys                      values            
# [1,] "CUST_Id_8Name"           "Mr.Praveen Kumar"
# [2,] "DOB"                     ""                
# [3,] "Mother's Name"           ""                
# [4,] "Contact Num"             ""                
# [5,] "Email address"           ""                
# [6,] "Owns Car"                ""                
# [7,] "Products held with Bank" ""                
# [8,] "Company Name"            ""                
# [9,] "Salary per. month"       ""                
# [10,] "Background"              ""