引用字符串中的EOF

时间:2015-01-01 11:58:43

标签: r twitter

我有一个大约350,000行的文件。 我正在使用

S.longtermtest="COKE32500.csv"
x=read.csv(S.longtermtest, row.names=NULL)

将其读取到x。

未正确指定/格式化数据。下面是一个例子,在id =“1020608401”的记录中,有额外的引号,所以read.csv不按我需要的方式处理它。有没有办法读取这种数据,并且每当看到超过7个字段时,只是跳过该记录?

7个字段名称是

"Tweet ID","Date","Username","Text","Location","Followings","Followers"

以下是数据样本:

"Tweet ID","Date","Username","Text","Location","Followings","Followers"
"1020580305","09/26/2012 17:01:00","LoudpackMariee","RT @marvelousbby: @LaLaaKors no you just say too much on twitter","Niggas wit coke","0","0"
"1020591756","09/26/2012 17:22:46","Sofia_Torrez","I must give up Coke. @AnahiMarcial http://t.co/O9QyXGjg","San Diego, California","0","0"
"1020594942","09/26/2012 17:24:22","PaterGH13","Con mi paisana y amiga @MireiaCarrillo  de GH 12 tomando una coke en los madriles) un gustazo verte!! Muuuuak http://t.co/V2QFqi08","Barcelona, España","0","0"
"1020595525","09/26/2012 17:24:40","WP2_DotCom","RT @UNADining: Stop by the GUC today and enter the drawing for a chance to win a coke bean bag http://t.co/URCtK1p9","Florence, Alabama","0","0"
"1020600160","09/26/2012 17:42:30","RadicalWizard_","I always block accounts like this, but this has got to be the best spam bot bio I've ever encountered. Str8 2 the point http://t.co/ICOD3zuI","Sniffing lines of coke","0","0"
"1020600792","09/26/2012 17:42:49","LexDLutor","coke boyz _ instru (prod by Lex Lut'Or) by @LexDLutor via #soundcloud http://t.co/5WW5i1uT","Bruxelles Belgique","0","0"
"1020602605","09/26/2012 17:43:45","LoudpackMariee","","RT @PardonMyLips: Forreal. RT @intoxicatedBia: You hip RT “@LoudpackMariee: This bitch tells her entire life story on twitter. ENiggas wit coke","0","0"
"1020608358","09/26/2012 17:46:40","SimplySophie","RT @BreezysLullaby: i asked for a coke not strawberry milkshake @mcdonalds!!!! http://t.co/DwC8l6iZ","Liverpool","0","0"
"1020608401","09/26/2012 17:46:41","danielzol4nski","RT @heymarkey: "@wezhopwood: A Cinema was robbed last night of £754, thieves took a bag of malteasers, a pick n mix and a large coke." h ...","Krim+Azealia's block list (UK)","0","0"
"1020644783","09/26/2012 19:00:54","klierlyshirley","Cheery coke? LOL http://t.co/bGMg2cAV","the hundred acre woods.","0","0"
"1020660546","09/26/2012 19:24:50","midsummerfrenzy","Cherry Coke is the best. http://t.co/7xzlvkOe","Tulsa, OK, USA","0","0"

2 个答案:

答案 0 :(得分:0)

如果要跳过超过7个字段的行,

lines <- readLines('chaotic.txt')
library(stringr)
dat1 <- read.table(text=lines[str_count(lines, ',')==6],header=TRUE,sep=",")
head(dat1,2)
 #   Tweet.ID                Date       Username
 #1 1020580305 09/26/2012 17:01:00 LoudpackMariee
 #2 1020600792 09/26/2012 17:42:49      LexDLutor
                                                                                        #Text
#1                           RT @marvelousbby: @LaLaaKors no you just say too much  on twitter
#2 coke boyz _ instru (prod by Lex Lut'Or) by @LexDLutor via #soundcloud http://t.co/5WW5i1uT
 #           Location Followings Followers
 #1    Niggas wit coke          0         0
 #2 Bruxelles Belgique          0         0

更新

基于新数据集COKE.csv

 lines <- readLines('COKE.csv')
 library(stringi)
 lines1 <- lines[stri_count_fixed(lines, ',')==6]
 dat <- read.table(text= lines1, header=TRUE, sep=",", quote=NULL, comment='')
 dim(dat)
 #[1] 229012      7

答案 1 :(得分:0)

问题在于,有些推文是&#34;推文引用&#34;无论生成什么,CSV都没有正确处理它们。这段代码:

  • 将文件作为一系列行读入
  • 在看到,"时分割线条;不可否认,这些可能错误在一些推文上,但它适用于此示例
  • 获取每个拆分列并删除周围的引号(即取消引用它们)
  • 将它全部放回data.frame(好吧,data.table,但几乎相同)

因此,它处理推文引用而不必丢弃它们。

tmp <- readLines("~/Desktop/tweet.csv", skip=1)

dat <- data.table::rbindlist(lapply(strsplit(tmp[2:length(tmp)], '",'), function(x) {
  lapply(x, function(y) {
    gsub('(^"|"$)', "", y)
  })
}))

data.table::setnames(dat, colnames(dat), unlist(strsplit(tmp[1], ",")))

dplyr::glimpse(dat)

## Variables:
## $ "Tweet ID"   (chr) "1020580305", "1020591756", "1020594942", "1020595525", "1020600160", "...
## $ "Date"       (chr) "09/26/2012 17:01:00", "09/26/2012 17:22:46", "09/26/2012 17:24:22", "0...
## $ "Username"   (chr) "LoudpackMariee", "Sofia_Torrez", "PaterGH13", "WP2_DotCom", "RadicalWi...
## $ "Text"       (chr) "RT @marvelousbby: @LaLaaKors no you just say too much on twitter", "I ...
## $ "Location"   (chr) "Niggas wit coke", "San Diego, California", "Barcelona, Espa\303\261a",...
## $ "Followings" (chr) "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"
## $ "Followers"  (chr) "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"