R文本清理

时间:2016-12-13 01:03:32

标签: r regex

我有一个csv文件,其中包含许多条目,如下所示(提供了一个示例):

Customer 1 car purchase
08/22/2016 08:10:00 Agent 1 (Agt1)
Customer 1 car purchase and service purchase.\n
Service indicates tires needed\n
possible oil change as well.\n
Tire quote provided.\n
*Name: Service advisor \n
*Phone: 123-456-7890 \n
Customer 1 called back to schedule appt.\n

我正在尝试编写R代码,输出如下(对于每个条目)

Customer 1 car purchase and service purchase.
Service indicates tires needed and possible oil change as well.
Tire quote provided.
Customer 1 called back to schedule appt.

我希望删除前两行和任何带有* Name和* Phone out的行。

我尝试的一件事是使用将每个条目分配给临时变量然后

stri_split_lines (temp)
x=stri_split_lines(temp)
y=x[[1]][3:length(x[[1]])]

这提取出前两行。但是我不知道如何使用* Name和* Phone提取行,因为它们可能位于文本的任何位置。我也相信可能有更好的方法:) 有关如何实现这一目标的任何想法? 这些行最后都是\ n,因此我希望使用正则表达式进行拆分,但是无法使其工作。 谢谢!

1 个答案:

答案 0 :(得分:0)

您可以使用readLinesstrsplit来读取每个条目(必要时使用lapply),然后grep进行索引:

x <- readLines(textConnection('Customer 1 car purchase
                               08/22/2016 08:10:00 Agent 1 (Agt1)
                               Customer 1 car purchase and service purchase.
                               Service indicates tires needed
                               possible oil change as well.
                               Tire quote provided.
                               *Name: Service advisor 
                               *Phone: 123-456-7890 
                               Customer 1 called back to schedule appt.'))

x <- trimws(x)    # clean up extra white space

x[c(-1, -2, -grep('\\*Name|\\*Phone', x))]
## [1] "Customer 1 car purchase and service purchase."
## [2] "Service indicates tires needed"               
## [3] "possible oil change as well."                 
## [4] "Tire quote provided."                         
## [5] "Customer 1 called back to schedule appt." 
如果你愿意的话,

paste回到一个区块。