我有一组来自Twitters API的推文使用R. 我将这些保存在.CSV文件中。每天一个。有些日子,我忘了运行脚本,所以我可能会有像
这样的东西file2015-08-24 file2015-08-22
我想要更有条理。由于我错过了8-23,因此8-23的推文只存储在文件2015-08-24中。我想创建一个新的.CSV,并将“created_at”时间为08-23的任何推文移动到08-23文件,并将08-24中创建的推文留在08-24文件中。
按日期移动推文非常有效。但是,当我将推文转移到新文件时,我的CSV文件会被破坏。一些奇怪的逗号交互正在发生。
这是一个例子! 7月31日,希拉里克林顿发了一条推文。在我的.CSV文件中,“text”列存储为
H&安培;
请注意,以上都是单声道。我知道h&出现在第二行,其间有大量的空白区域,但在CSV格式中,这是一个单元格。 但是,当我将此行重写为新的CSV文件时,我得到....
如果我在Notepad ++中打开相同的两个文件来检查CSV ... 在Notepad ++中,推文出现时前几列正常,然后“text”列位于两条不同的行上。第一行是: “RT @ TheBriefing2016:逐步取消医疗保险,废除ACA,并建立投票障碍意味着只有一些人获得了”“崛起的权利。” “文字”的第二行: H&安培;”,
当我打开文件时,我重写了它,它也在两行: “RT @ TheBriefing2016:逐步取消医疗保险,废除ACA,以及建立投票障碍意味着只有一些人获得了”崛起的权利“。 H&安培;”,
我不确定是什么导致它在原始文件中正确显示,而不是在这里。这不是唯一能做到这一点的推文。其他一些带引号的人也可以。我觉得它从引号中突破了。
以下是我用来从一个文件转移到另一个文件的代码。
for(curFile in filenames){
## Read in the file
info = read.csv(curFile, header=TRUE, sep=",")
## Updated DF will hold what is in our original file, MINUS the rows that are getting removed.
updatedDF = info
## Get the file date
fileDate = curFile
fileDate = substr(fileDate, 78, 300)
#fileDate=substr(fileDate,85,300)
fileDate = substr(fileDate, 0, 10)
## Get the header from the file
header = names(info)
## Figure out how many rows of data we have
## This is the number of tweets we have in this data file
numTweets = dim(info)[1]
## For every tweet, starting with tweet #1, up to the last tweet (numTweets)
for( x in 1:numTweets) {
## Get the tweets date
## We want to get this as a VECTOR so we can do character / string manipulations on it
tweetDateLine = as.vector(info[x, "created_at"])
### To get the date from the file, we are going to need to do some editting to the string
year = substr(tweetDateLine, nchar(tweetDateLine)-3, 300)
monthDay = substr(tweetDateLine, 5, 10)
### Strip the white space from these
year = gsub(" ", "", year)
monthDay = gsub(" ", "", monthDay)
### Put them together for a cohesive MMMDDYYYY
tweetDate = paste(monthDay, year, sep="")
### Finally convert this to YYYY-MM-DD format like our original date has as extracted from the file name
tweetDate = as.Date(tweetDate, "%B%d%Y")
### Now we can compare
### Make a boolean variable. If it is TRUE they are the same
isTheSame = (fileDate == tweetDate)
### If the date of the tweet and the date of the file are the same...
if(isTheSame){
### Skip to the next tweet
next
} ## if(isTheSame){
### If the date of the tweet and the date of the file are not the same...
else{
### See if a file exists for the date of that tweet.
### First, construct the file name with the path + the date + .csv
potentialFileName = paste(path, tweetDate, ".csv", sep="")
### Next, see if it exists!
fileExists = file.exists(potentialFileName)
### If the file already exists...
if(fileExists){
### Now we need to add the data
### To get row "x" of the data...
entireRow = info[x,]
### Now append the row to that file
cat(sprintf("Writing tweet to file!\n"))
write.table(rbind(entireRow),file=potentialFileName,row.names=FALSE,col.names=FALSE,sep=",",append=TRUE)
### Delete this line from the original file
updatedDF = updatedDF[updatedDF$created_at != tweetDateLine, ]
} ##if(fileExists){
### If the file does not already exist
else{
### Create the file
cat(sprintf("Creating file for date : %s \n", tweetDate))
file.create(potentialFileName)
### Add the header line
cat(sprintf("Inserting header!\n"))
write.table(rbind(header), file=potentialFileName, row.names=FALSE, col.names=FALSE, sep=",")
### Now we need to add the data
### To get row "x" of the data...
entireRow = info[x,]
### Now append the row to that file
cat(sprintf("Writing tweet to file!\n"))
write.table(rbind(entireRow),file=potentialFileName,row.names=FALSE,col.names=FALSE,sep=",",append=TRUE)
### Delete this line from the original file
updatedDF = updatedDF[updatedDF$created_at != tweetDateLine,]
} ## else{
} ## else{
}##for( x in 1:numTweets) {
# Now we must take the updatedDF, which contains the original CSV minus the deleted lines
# And write it back to the original file
# Start with replacing the header
cat(sprintf("Inserting header!\n"))
write.table(rbind(header), file=curFile, row.names=FALSE, col.names=FALSE, sep=",")
# Now print the dataframe back
cat(sprintf("Inserting dataframe!\n"))
write.table(updatedDF, file=curFile, row.names=FALSE, col.names = FALSE, sep=",", append=TRUE)
} ## for(curFile in fileNames){
为了进一步帮助大家:http://imgur.com/a/9xwx5这是我在Excel / NPP中查看原文的视图,然后是将推文移动到新文件之后。
如果它也帮助推动这一点的推文(好吧,其中一个。有几个。) - >是这篇推文的ReTweet。 https://twitter.com/TheBriefing2016/status/627212836339453952