Question

我正在尝试使用gsub（）来清理csv格式的文本数据集。现在，我的数据样本行如下：

"5.0\t/gp/customer-reviews/R3M62HO4M6LXE6?ASIN=0439023521\tEngaging. Brutal but engaging!\t\"Wow.  I was barely able to put this book down for a second after the first few pages got me completely hooked.

我想删除没有提供任何内容的开头字符串，并删除所有\ t \或\ t，以便获得预期的结果，如

"Engaging.  Brutal but engaging!"Wow.  I was barely able to put this book down for a second after the first few pages got me completely hooked.

我尝试使用

gsub('\\t\\', "", comment, fix=TRUE)

删除\ t \但它没有用。并且开头的字符串太复杂了我在编写正确的模式表达式时遇到了麻烦。

Answer 1

我们可以尝试

SELECT *
FROM (
  SELECT rank() OVER (ORDER BY x) AS dr, x
  FROM (
    SELECT
      trunc(random()*1000) AS x
    FROM generate_series(1,100)
  ) AS t
) AS t
WHERE dr BETWEEN 80-10 AND 80+10;

 dr |  x  
----+-----
 70 | 702
 71 | 706
 72 | 718
 73 | 734
 74 | 751
 75 | 756
 76 | 774
 77 | 778
 78 | 805
 79 | 813
 80 | 829
 81 | 833
 82 | 839
 83 | 852
 84 | 853
 85 | 872
 86 | 884
 86 | 884
 88 | 892
 89 | 897
 90 | 905
(21 rows)

Answer 2

如果您想使用stringr库：

library(stringr)
str_replace(val,".*\\t(?=[:alnum:])","")

使用 gsub ：

gsub(".*\\t(?=[a-zA-Z0-9])", "", val,perl=T)

或 gsub(".*\\t(?=[[:alnum:]])", "", val,perl=T)

<强>输出：

 > str_replace(val,".*\\t(?=[:alnum:])","")
[1] "Engaging. Brutal but engaging!\t\"Wow.  I was barely able to put this book down for a second after the first few pages got me completely hooked."

如何使用gsub删除复杂模式

2 个答案: