编辑 - https://en.wikipedia.org/wiki/Substitute_character
这是造成我问题的角色。它在文件中出现几次。
我的.CSV文件中的行是
"Bernie+Sanders","3377900000","3377929757",";5:A59 0940;8=","kajdalin","96","155","",0,0,"Thu Dec 31 13:37:36 +0000 2015","RT @realDonaldTrump: ""@deggow: Just heard a 25 year old man say ""I would rather work for Donald Trump than Bernie Sanders""it's time for me &","6.8256e+17","682556208335695872","<a href=""http://twitter.com"" rel=""nofollow"">Twitter Web Client</a>",NA,NA,NA,NA,NA,"6.8251e+17","682514849579057152","Thu Dec 31 10:53:15 +0000 2015","Donald J. Trump","realDonaldTrump","New York, NY","5496400","51","<a href=""http://twitter.com/download/android"" rel=""nofollow"">Twitter for Android</a>"
阅读后崩溃; 5:A59
查看我的数据。这是来自Twitter用户,其用户名是阿拉伯语。那个垃圾,&#34 ;; 5:A59 0940; 8 =&#34;,是我将阿拉伯语保存到文件中时的API读取方式。
在我的Notepad ++中,CSV以不同方式显示垃圾。它用特殊不可见的字符显示它,我将在[方括号]中显示。
"[DLE];5:A59 [SUB]0940;8="
我如何阅读文件:
rawData = read.csv("path.csv", header=TRUE,encoding="UTF-8")
如果有帮助,可以使用R 3.2.3中的Windows 10
我该如何避免这种情况?
为了澄清,.csv有大约80,000行。这发生在第4908行。原谅下面的格式,我知道事情并没有完美排列。
queryTarget userID userID_str userName userScreen_name userFollowers_count userFriends_count userLocation latitude longitude
Bernie+Sanders 3377900000 3377929757 \020;5:A59 NA NA NA NA
tweetSendTime tweetText tweetID tweetID_str tweetSource replyStatusID replyStatusID_str replyUserID replyUserID_str replyScreenName
NA NA NA NA NA NA
retweetedStatusID retweetedStatusID_str retweetedStatus_createdAt retweetedStatus_userName retweetedStatus_screenName
NA NA
retweetedStatusLocation retweetedStatusFollowers retweetedStatusFriends retweetedStatusSource
NA NA
我得到的错误是:
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string