R - 无法读取带控制字符的文件[SUB]

时间:2016-03-05 20:14:58

标签: regex r ascii invisible

之前我遇到过这个问题,但我之前的解决方案并没有解决它。

在我的文本数据中,当我显示所有字符时,在Notepad ++中,会出现一个列为[SUB]的字符。

上一次,我删除了这些......

## Read the file in as Binary
r = readBin( curFile, raw(), file.info(curFile)$size)

## Convert the pesky characters
if ((r[1]==as.raw(0x1a)))
{
    ## Find it
    spot = which(r == as.raw(0x1a) )
    r[r == as.raw(0x1a)] = as.raw(0x20)
} 

但是,这不起作用。似乎每次我设法逃脱一个看不见的角色,在一个星期内,另一个让我有问题。除了分隔我的数据条目的新行之外,有没有办法有效地“清理”所有不可见控制字符的文件?

请告诉我。这已经令人抓狂了。

谢谢!

我可以为您制作有限的CSV文件。这是导致崩溃的第二行,第4列。

http://www.megafileupload.com/6ead/stackOverflow.csv

我用来执行此操作的完整代码如下所示....

library(stringr)
############# DO THIS FIRST 
folder = "C:\\Twitter_TimeSeries\\Bernie_Practice\\"

## Get the file name of every file in the directory 
file.names = dir(folder, pattern=".csv")

## Figure out how many files there are
numFiles = length(file.names)

## Loop through every file 
for( i in 1:length(file.names))
{
    ## Which file are we on?
    curFile = paste( folder, file.names[i], sep="" )

    ## Read the file in as Binary
    r = readBin( curFile, raw(), file.info(curFile)$size)

    ## Convert the pesky characters
    if ((r[1]==as.raw(0x1a)))
    {
        ## Find it
        spot = which(r == as.raw(0x1a) )
        r[r == as.raw(0x1a)] = as.raw(0x20)
    } 
    if ((r[1]==as.raw(0x0a))) {
        ## Find it
        spot = which(r == as.raw(0x0a) )
        r[r == as.raw(0x1a)] = as.raw(0x20)
    } ## If 
    ## Re-write the file
    writeBin(r, curFile)
} ## For

curFile = stackOverflow.csv
rawData = read.csv(curFile, stringsAsFactors=FALSE)

1 个答案:

答案 0 :(得分:0)

尝试使用正则表达式将数据限制为仅允许的字符。

x = read.csv("foo.csv",colClasses="character") x = gsub("[^0-9\\.]","",x) # just numbers and '.' x = as.numeric(x) # Assuming your file really represents numeric data