R - 从一个CSV写入另一个CSV - 奇怪的逗号交互

时间:2015-08-25 19:23:20

标签: r csv twitter formatting

我有一组来自Twitters API的推文使用R. 我将这些保存在.CSV文件中。每天一个。有些日子,我忘了运行脚本,所以我可能会有像

这样的东西

file2015-08-24 file2015-08-22

我想要更有条理。由于我错过了8-23,因此8-23的推文只存储在文件2015-08-24中。我想创建一个新的.CSV,并将“created_at”时间为08-23的任何推文移动到08-23文件,并将08-24中创建的推文留在08-24文件中。

按日期移动推文非常有效。但是,当我将推文转移到新文件时,我的CSV文件会被破坏。一些奇怪的逗号交互正在发生。

这是一个例子! 7月31日,希拉里克林顿发了一条推文。在我的.CSV文件中,“text”列存储为

TEXT

RT @ TheBriefing2016:逐步取消医疗保险,废除ACA,以及建立投票障碍意味着只有一些人获得“崛起的权利”。

H&安培;

请注意,以上都是单声道。我知道h&出现在第二行,其间有大量的空白区域,但在CSV格式中,这是一个单元格。 但是,当我将此行重写为新的CSV文件时,我得到....

TEXT

RT @ TheBriefing2016:逐步淘汰医疗保险,废除ACA,并建立投票障碍意味着只有一些人才能获得升职权。 那么我的CSV飞到了下一行,并将其列入第一列:h&“ 然后数据继续正常填充列,但是当我们跳到下一行时,它们显然位于“错误”列中,并从第1列开始。

如果我在Notepad ++中打开相同的两个文件来检查CSV ... 在Notepad ++中,推文出现时前几列正常,然后“text”列位于两条不同的行上。第一行是: “RT @ TheBriefing2016:逐步取消医疗保险,废除ACA,并建立投票障碍意味着只有一些人获得了”“崛起的权利。” “文字”的第二行: H&安培;”,

当我打开文件时,我重写了它,它也在两行: “RT @ TheBriefing2016:逐步取消医疗保险,废除ACA,以及建立投票障碍意味着只有一些人获得了”崛起的权利“。 H&安培;”,

我不确定是什么导致它在原始文件中正确显示,而不是在这里。这不是唯一能做到这一点的推文。其他一些带引号的人也可以。我觉得它从引号中突破了。

以下是我用来从一个文件转移到另一个文件的代码。

for(curFile in filenames){
## Read in the file
info = read.csv(curFile, header=TRUE, sep=",")

## Updated DF will hold what is in our original file, MINUS the rows that are getting removed.
updatedDF = info
## Get the file date
fileDate = curFile
fileDate = substr(fileDate, 78, 300)
#fileDate=substr(fileDate,85,300)
fileDate = substr(fileDate, 0, 10)

## Get the header from the file
header = names(info)

## Figure out how many rows of data we have
## This is the number of tweets we have in this data file
numTweets = dim(info)[1]

## For every tweet, starting with tweet #1, up to the last tweet (numTweets)
for( x in 1:numTweets) {
    ## Get the tweets date
    ## We want to get this as a VECTOR so we can do character / string manipulations on it
    tweetDateLine = as.vector(info[x, "created_at"])

    ### To get the date from the file, we are going to need to do some editting to the string
    year = substr(tweetDateLine, nchar(tweetDateLine)-3, 300)
    monthDay = substr(tweetDateLine, 5, 10)

    ### Strip the white space from these
    year = gsub(" ", "", year)
    monthDay = gsub(" ", "", monthDay)

    ### Put them together for a cohesive MMMDDYYYY
    tweetDate = paste(monthDay, year, sep="")

    ### Finally convert this to YYYY-MM-DD format like our original date has as extracted from the file name
    tweetDate = as.Date(tweetDate, "%B%d%Y")

    ### Now we can compare
    ### Make a boolean variable. If it is TRUE they are the same
    isTheSame = (fileDate == tweetDate)

    ### If the date of the tweet and the date of the file are the same...
    if(isTheSame){
        ### Skip to the next tweet
        next
    } ## if(isTheSame){

    ### If the date of the tweet and the date of the file are not the same...
    else{
        ### See if a file exists for the date of that tweet. 
        ### First, construct the file name with the path + the date + .csv
        potentialFileName = paste(path, tweetDate, ".csv", sep="")

        ### Next, see if it exists!
        fileExists = file.exists(potentialFileName)

        ### If the file already exists...
        if(fileExists){

            ### Now we need to add the data
            ### To get row "x" of the data...
            entireRow = info[x,]

            ### Now append the row to that file
            cat(sprintf("Writing tweet to file!\n"))
            write.table(rbind(entireRow),file=potentialFileName,row.names=FALSE,col.names=FALSE,sep=",",append=TRUE)                

            ### Delete this line from the original file
            updatedDF = updatedDF[updatedDF$created_at != tweetDateLine, ]
        } ##if(fileExists){

        ### If the file does not already exist
        else{


            ### Create the file
            cat(sprintf("Creating file for date : %s \n", tweetDate))
            file.create(potentialFileName)

            ### Add the header line
            cat(sprintf("Inserting header!\n"))
            write.table(rbind(header), file=potentialFileName, row.names=FALSE, col.names=FALSE, sep=",")

            ### Now we need to add the data
            ### To get row "x" of the data...
            entireRow = info[x,]

            ### Now append the row to that file
            cat(sprintf("Writing tweet to file!\n"))
            write.table(rbind(entireRow),file=potentialFileName,row.names=FALSE,col.names=FALSE,sep=",",append=TRUE)

            ### Delete this line from the original file
            updatedDF = updatedDF[updatedDF$created_at != tweetDateLine,]
        } ## else{

    } ## else{

}##for( x in 1:numTweets) {

# Now we must take the updatedDF, which contains the original CSV minus the deleted lines
# And write it back to the original file
# Start with replacing the header
cat(sprintf("Inserting header!\n"))
write.table(rbind(header), file=curFile, row.names=FALSE, col.names=FALSE, sep=",")

# Now print the dataframe back
cat(sprintf("Inserting dataframe!\n"))
write.table(updatedDF, file=curFile, row.names=FALSE, col.names = FALSE, sep=",", append=TRUE)

} ## for(curFile in fileNames){

为了进一步帮助大家:http://imgur.com/a/9xwx5这是我在Excel / NPP中查看原文的视图,然后是将推文移动到新文件之后。

如果它也帮助推动这一点的推文(好吧,其中一个。有几个。) - >是这篇推文的ReTweet。 https://twitter.com/TheBriefing2016/status/627212836339453952

0 个答案:

没有答案