试图读取文件。有特殊性格" SUB"在Notepad ++中。崩溃了我的阅读

时间:2016-02-09 05:56:20

标签: r file unicode encoding character-encoding

编辑 - https://en.wikipedia.org/wiki/Substitute_character

这是造成我问题的角色。它在文件中出现几次。

我的.CSV文件中的行是

"Bernie+Sanders","3377900000","3377929757",";5:A59 0940;8=","kajdalin","96","155","",0,0,"Thu Dec 31 13:37:36 +0000 2015","RT @realDonaldTrump: ""@deggow: Just heard a 25 year old man say ""I would rather work for Donald Trump than Bernie Sanders""it's time for me &","6.8256e+17","682556208335695872","<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>",NA,NA,NA,NA,NA,"6.8251e+17","682514849579057152","Thu Dec 31 10:53:15 +0000 2015","Donald J. Trump","realDonaldTrump","New York, NY","5496400","51","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>"

阅读后崩溃; 5:A59

查看我的数据。这是来自Twitter用户,其用户名是阿拉伯语。那个垃圾,&#34 ;; 5:A59 0940; 8 =&#34;,是我将阿拉伯语保存到文件中时的API读取方式。

在我的Notepad ++中,CSV以不同方式显示垃圾。它用特殊不可见的字符显示它,我将在[方括号]中显示。

"[DLE];5:A59 [SUB]0940;8="

我如何阅读文件:

rawData = read.csv("path.csv", header=TRUE,encoding="UTF-8")

如果有帮助,可以使用R 3.2.3中的Windows 10

我该如何避免这种情况?

为了澄清,.csv有大约80,000行。这发生在第4908行。原谅下面的格式,我知道事情并没有完美排列。

queryTarget     userID userID_str    userName userScreen_name userFollowers_count userFriends_count userLocation latitude longitude
Bernie+Sanders 3377900000 3377929757 \020;5:A59                                   NA                NA                    NA        NA
 tweetSendTime tweetText tweetID tweetID_str tweetSource replyStatusID replyStatusID_str replyUserID replyUserID_str replyScreenName
NA          NA                        NA                NA          NA              NA                
 retweetedStatusID retweetedStatusID_str retweetedStatus_createdAt retweetedStatus_userName retweetedStatus_screenName
NA                    NA                                                                              
 retweetedStatusLocation retweetedStatusFollowers retweetedStatusFriends retweetedStatusSource
NA                     NA        

我得到的错误是:

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
EOF within quoted string

0 个答案:

没有答案