Question

我有一个csv文件，其中包含许多条目，如下所示（提供了一个示例）：

Customer 1 car purchase
08/22/2016 08:10:00 Agent 1 (Agt1)
Customer 1 car purchase and service purchase.\n
Service indicates tires needed\n
possible oil change as well.\n
Tire quote provided.\n
*Name: Service advisor \n
*Phone: 123-456-7890 \n
Customer 1 called back to schedule appt.\n

我正在尝试编写R代码，输出如下（对于每个条目）

Customer 1 car purchase and service purchase.
Service indicates tires needed and possible oil change as well.
Tire quote provided.
Customer 1 called back to schedule appt.

我希望删除前两行和任何带有* Name和* Phone out的行。

我尝试的一件事是使用将每个条目分配给临时变量然后

stri_split_lines (temp)
x=stri_split_lines(temp)
y=x[[1]][3:length(x[[1]])]

这提取出前两行。但是我不知道如何使用* Name和* Phone提取行，因为它们可能位于文本的任何位置。我也相信可能有更好的方法:) 有关如何实现这一目标的任何想法？这些行最后都是\ n，因此我希望使用正则表达式进行拆分，但是无法使其工作。谢谢！

Answer 1

您可以使用readLines或strsplit来读取每个条目（必要时使用lapply），然后grep进行索引：

x <- readLines(textConnection('Customer 1 car purchase
                               08/22/2016 08:10:00 Agent 1 (Agt1)
                               Customer 1 car purchase and service purchase.
                               Service indicates tires needed
                               possible oil change as well.
                               Tire quote provided.
                               *Name: Service advisor 
                               *Phone: 123-456-7890 
                               Customer 1 called back to schedule appt.'))

x <- trimws(x)    # clean up extra white space

x[c(-1, -2, -grep('\\*Name|\\*Phone', x))]
## [1] "Customer 1 car purchase and service purchase."
## [2] "Service indicates tires needed"               
## [3] "possible oil change as well."                 
## [4] "Tire quote provided."                         
## [5] "Customer 1 called back to schedule appt."

如果你愿意的话，

paste回到一个区块。

R文本清理

1 个答案: