使用正则表达式删除字符串产生特殊字符:â

时间:2015-11-06 16:42:51

标签: r

简短版本:

在使用正则表达式删除网址和空格后,我有很多{ "directory": "wwwroot/lib" } 个文件,其中包含一些不需要的字符 并且点缀在任何地方。我需要从所有文件中删除所有这些。

在清理文件之前,这些 不存在,它们是由于清理而产生的。

长版

我发现了一个适用于我的文字的正则表达式,并且正在删除这些网址。 首先,我的清洁过程(注释掉的行是我尝试过的其他内容):

.txt

示例输入文本(每行是一个字符串):

clean_file <-  sapply(curr_file, function(x) {
    gsub("&amp;", "&", x) %>%
        gsub("http\\S+\\s*", "", .) %>%
        gsub("[^[:alpha:][:space:]&']", "", .) %>%
        #gsub("[^[:alnum:][:space:]\\'-]", "", .) %>%
        stripWhitespace() %>%
        gsub("^ ", "", .) %>%
        gsub(" $", "", .)
        #gsub("â", "", .)
})

不幸的是,它没有出现在这里,但上面的文字中也有一些非标准字符,即Gluskin’s Rosenberg: Don’t Bet on a Bear Market for Treasurys - Rising Treasury yields?... http://j.mp/UVM31t  #FederalReserve Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market: Large investment asset losses can be… http://goo.gl/fb/cgzGv  Thank You http://pages.townhall.com/campaign/will-2013-be-a-bull-or-bear-market … via @townhallcom Calif. GHG cap-and-trade: a bull or a bear market? http://bit.ly/VG9DTr  R 就可以看到它们:

\302

它们可能来自> x = _ <-- appears as an underscore in my text editor Error: object '\302' not found as hinted here,但它们是我数据的人工制品,所以我需要删除它们 - 我无法阻止它们。

生成的输出(在已保存的shift+space文件中可见):

.txt

输出在R控制台中可见:

Gluskinâs Rosenberg Donât Bet on a Bear Market for Treasurys - Rising Treasury yields FederalReserve
Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can beâ
Thank You â via townhallcom
Calif GHG cap-and-trade a bull or a bear market

在我将此视为编码问题之前,只需将'字符替换为字符失败:

> head(clean_file)
      ..text                                                                                                        
[1,] "Nice bear market rally for the Lakers NBA"                                                                    
[2,] "Commented on StockTwits your scenario is entirely possible and as long as SPX doesn't exceed the bear market" 
[3,] "Gluskin\342s Rosenberg Don\342t Bet on a Bear Market for Treasurys Rising Treasury yields FederalReserve"           
[4,] "Jacquiline Chabolla liked Capital Preservation In a Secular Bear Market Large investment asset losses can be\342"
[5,] "Thank You \342 via townhallcom"
[6,] "Calif GHG capandtrade a bull or a bear market"

我尝试了一些其他解决方案来更改文件的编码(在solutions here中找到) 我试图写入文件强制输出gsub("â", "", myText) 而不是默认的utf-8(我相信),但是ascii只是给了我警告并截断了很多行,留下了一些完全空的。删除的行与之前出现的 字符之间似乎没有任何关联。

我可以尝试阻止在将来写作时创建这些字符吗?

1 个答案:

答案 0 :(得分:5)

这只保留从十六进制0到十六进制7f的字符,其中Lines是一个字符向量,其组成部分是文件的行:

gsub("[^\\x{00}-\\x{7f}]", "", Lines, perl = TRUE)