合并csv文件后维护列结构

时间:2012-06-15 17:28:14

标签: r twitter csv merge

我正在合并近3.000个csv文件,删除重复的行,并编写新的csv数据文件。为此,我使用了以下代码:

#Grab our list of filenames
filenames <- list.files(path = ".", pattern='^.*\\.csv$')
#Read.csv function
my.read.csv <- function(fnam) { read.csv(fnam, header=FALSE, skip=1, sep=';', col.names=c('ID','tweet','author','local.time','extra'), colClasses=rep('character', 5))}
#Read all the files into one data.frame
my.df <- do.call("rbind", lapply(filenames, my.read.csv)) length(my.df[,1])
#Remove the duplicate tweets
my.new.df <- my.df[!duplicated(paste(my.df$tweet, my.df$author)),]
length(my.new.df[,1])
#Write new dataframe as a .csv file
write.csv(my.new.df, file =paste("Dataset", ".csv"))

虽然该函数执行了它想要做的事情,但输出文件很乱。原始csv文件都具有以下结构:

tweet                                                         author    local.time
2012-06-05 00:01:45 @A (A1):  Cruijff z'n (...)#bureausport.  A (A1)    05-06-12 00:01
2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport        B (B1)    05-06-12 00:01
2012-06-05 00:01:38 @C (C1):  Echt (...) #bureausport         C (C1)    05-06-12 00:01
2012-06-05 00:01:38 @D (D1):  LOL. #bureausport               D (D1)    05-06-12 00:01

但是输出文件具有以下结构:

,"ID","tweet","author","local.time","extra"
1,"2012-06-05 00:01:45 @A (A1):  Cruijff z'n (...)#bureausport.","@A (A1)","05-06-12 00:01"
2,"2012-06-05 00:01:41 @B (B1):  Welterusten #BureauSport","@B (B1)","05-06-12 00:01"
3,"2012-06-05 00:01:38 @C (C1):  Echt (...) #bureausport","Aliceislovely (Alice Luyben)","05-06-12 00:01"
4,"2012-06-05 00:01:38 @D (D1):  LOL. #bureausport","@D (D1)","05-06-12 00:01"

因此它将数据表示为字符串而不是列。我希望你可以帮我调整代码(上图),使输出文件具有与原始(输入)csv文件相同的列结构。

顺便说一下,下面的代码用于创建csv文件:

library(XML)   # htmlTreeParse

twitter.search <- "Keyword"

QUERY <- URLencode(twitter.search)

# Set time loop (in seconds)
d_time = 300
number_of_times = 3000

for(i in 1:number_of_times){

tweets <- NULL
tweet.count <- 0
page <- 1
read.more <- TRUE

while (read.more)
{
# construct Twitter search URL
URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=', page, sep='')
# fetch remote URL and parse
XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})

# Extract list of "entry" nodes
entry     <- getNodeSet(XML, "//entry")

read.more <- (length(entry) > 0)
if (read.more)
{
for (i in 1:length(entry))
{
subdoc     <- xmlDoc(entry[[i]])   # put entry in separate object to manipulate

published  <- unlist(xpathApply(subdoc, "//published", xmlValue))

published  <- gsub("Z"," ", gsub("T"," ",published) )

# Convert from GMT to central time
time.gmt   <- as.POSIXct(published,"GMT")
local.time <- format(time.gmt, tz="Europe/Amsterdam")

title  <- unlist(xpathApply(subdoc, "//title", xmlValue))

author <- unlist(xpathApply(subdoc, "//author/name",  xmlValue))

tweet  <-  paste(local.time, " @", author, ":  ", title, sep="")

entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
tweet.count <- tweet.count + 1
rownames(entry.frame) <- tweet.count
tweets <- rbind(tweets, entry.frame)
}
page <- page + 1
read.more <- (page <= 15)   # Seems to be 15 page limit
}
}

names(tweets)

# top 15 tweeters
#sort(table(tweets$author),decreasing=TRUE)[1:15]

write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")

Sys.sleep(d_time)

} # end if

1 个答案:

答案 0 :(得分:2)

看起来您希望输出具有制表符分隔的字段,并且每列中的值都没有引号。以下是您可以这样做的方法:

write.table(mtcars, "mtcars.txt", quote=FALSE, sep="\t")

要快速预览和比较write.csv()和上述代码的调用输出,请尝试使用您自己的数据:

write.csv(head(mtcars))
write.table(head(mtcars), quote=FALSE, sep="\t")

修改:如果(而不是制表符分隔的字段)您需要将每列中的数据水平完全对齐,请查看包write.fwf中的gdataas demonstrated here。 (“fwf”代表“固定宽度格式”。)