我有2个文件:文件A包含11746774条推文,文件B包含704060条推文。我想计算文件A中没有的推文 - 文件B,即1174674 - 704060 = 470614.PFB程序。 MatchWise-Tweets.zip包含49个文件的列表,其中推文存储在49个单独的文件中。意图是获取文件名并传递每个文件名以获取49个文件中每个文件中存在的推文列表。
import csv
import zipfile
totTweets = 1174674
matchTweets = 704060
remaining = totTweets - matchTweets
lst = []
store = []
total = 0
#opFile = csv.writer(open('leftover.csv', "wb"))
mainFile = csv.reader(open('final_tweet_set.csv', 'rb'), delimiter=',', quotechar='|')
with zipfile.ZipFile('MatchWise-Tweets.zip', 'r') as zfile:
for name in zfile.namelist():
lst.append(name)
for getName in lst:
inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|')
for row in inFile:
store.append(row)
length = len(store)
print length
count=0
for main_row in mainFile:
flag=0
main_tweetID = main_row[0]
for getTweet in store:
get_tweetID = getTweet[0]
if main_tweetID == get_tweetID:
flag = 1
#print "Flag == 1 condition--",flag
break
if flag ==1:
continue
elif flag == 0:
count+=1
remaining-=1
#print "Flag == 0 condition--"
#print flag
opFile.writerow(main_row)
print remaining
实际结果 - 573655
预期结果 - 470614
文件结构 -
566813957629808000,saddest thing about this world cup is that we won't see mo irfan bowling at the waca in perth :( #pakvind #indvspak #cwc15 @waca_cricket,15/02/2015 15:19
566813959076855000,"#pakvsind 50,000 tickets for the game were sold out in 20 minutes #cwc15 #phordey #indvspak",15/02/2015 15:19
566813961505366000,think india will give sohail his first 5 for.. smh.. #indvspak #cwc15,15/02/2015 15:19
第一列是tweet-id,第二列是tweet-text,第三列是tweet-date。我只是想知道这个程序中是否存在问题,因为我没有得到理想的结果。
答案 0 :(得分:0)
import difflib
file1 = "PATH OF FILE 1"
file1 = open(file1, "r")
file2 = "PATH OF FILE 2"
file2 = open(file2, "r")
diff = difflib.ndiff(file1.readlines(), file2.readlines())
file1.close()
file2.close()
delta = ''.join(x[2:] for x in diff if x.startswith('- '))
print delta
答案 1 :(得分:0)
import difflib
import csv
file1 = open('final_tweet_set.csv', 'rb')
file2 = open("matchTweets_combined.csv","rb")
diff = difflib.ndiff(file1.readlines(), file2.readlines())
file1.close()
file2.close()
delta = ''.join(x[2:] for x in diff if x.startswith('- '))
#print delta
fout = csv.writer(open("leftover_new.csv","wb"))
for eachrow in delta:
fout.writerow(eachrow)
答案 2 :(得分:0)
您的代码假设您的文件不包含重复项。情况可能并非如此,为什么你的结果不对。
使用列表设置应该可以更容易地获得正确的结果并提高速度(因为它只会比较推文ID而不是整个推文及其元数据)。
以下使用集合,并且更紧凑和可读。它不完整,您必须添加打开zip文件和opfile的位(并关闭它们)。
tweet_superset = set() # your store
for getName in lst:
inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|')
tweet_supetset.update(entry[0] for entry in inFile)
# using a set means we ignore any duplicate tweets in the 49 source files.
length = len(tweet_superset)
print length
seen_tweets = set()
for entry in mainFile:
id_ = entry[0]
if id_ in tweet_superset:
if id_ in seen_tweets:
print "Error, this tweet appears more than once in mainFile:", entry
else:
opFile.writerow(entry)
seen_tweets.add(id_)
count = len(seen_tweets)
print count