Question

我有2个文件：文件A包含11746774条推文，文件B包含704060条推文。我想计算文件A中没有的推文 - 文件B，即1174674 - 704060 = 470614.PFB程序。 MatchWise-Tweets.zip包含49个文件的列表，其中推文存储在49个单独的文件中。意图是获取文件名并传递每个文件名以获取49个文件中每个文件中存在的推文列表。

import csv
import zipfile

totTweets = 1174674
matchTweets = 704060
remaining = totTweets - matchTweets     
lst = []
store = []
total = 0       
#opFile = csv.writer(open('leftover.csv', "wb"))
mainFile = csv.reader(open('final_tweet_set.csv', 'rb'), delimiter=',', quotechar='|') 
with zipfile.ZipFile('MatchWise-Tweets.zip', 'r') as zfile:
    for name in zfile.namelist():
        lst.append(name)

for getName in lst:
    inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|')
    for row in inFile:
        store.append(row)

length = len(store)
print length

count=0
for main_row in mainFile:
    flag=0
    main_tweetID = main_row[0]
    for getTweet in store:
        get_tweetID = getTweet[0]
        if main_tweetID == get_tweetID:
            flag = 1
            #print "Flag == 1 condition--",flag
            break
    if flag ==1:
        continue
    elif flag == 0:
        count+=1
        remaining-=1
        #print "Flag == 0 condition--"
        #print flag
        opFile.writerow(main_row)
        print remaining

实际结果 - 573655

预期结果 - 470614

文件结构 -

566813957629808000,saddest thing about this world cup is that we won't see mo irfan bowling at the waca in perth :( #pakvind #indvspak #cwc15 @waca_cricket,15/02/2015 15:19
566813959076855000,"#pakvsind 50,000 tickets for the game were sold out in 20 minutes #cwc15 #phordey #indvspak",15/02/2015 15:19
566813961505366000,think india will give sohail his first 5 for.. smh.. #indvspak #cwc15,15/02/2015 15:19

第一列是tweet-id，第二列是tweet-text，第三列是tweet-date。我只是想知道这个程序中是否存在问题，因为我没有得到理想的结果。

Answer 1

import difflib
file1 = "PATH OF FILE 1"
file1 = open(file1, "r")
file2 = "PATH OF FILE 2"
file2 = open(file2, "r")
diff = difflib.ndiff(file1.readlines(), file2.readlines())
file1.close()
file2.close()
delta = ''.join(x[2:] for x in diff if x.startswith('- '))
print delta

Answer 2

import difflib
import csv
file1 = open('final_tweet_set.csv', 'rb') 
file2 = open("matchTweets_combined.csv","rb")
diff = difflib.ndiff(file1.readlines(), file2.readlines())
file1.close()
file2.close()
delta = ''.join(x[2:] for x in diff if x.startswith('- '))

#print delta
fout = csv.writer(open("leftover_new.csv","wb"))
for eachrow in delta:
    fout.writerow(eachrow)

Answer 3

您的代码假设您的文件不包含重复项。情况可能并非如此，为什么你的结果不对。

使用列表设置应该可以更容易地获得正确的结果并提高速度（因为它只会比较推文ID而不是整个推文及其元数据）。

以下使用集合，并且更紧凑和可读。它不完整，您必须添加打开zip文件和opfile的位（并关闭它们）。

tweet_superset = set() # your store
for getName in lst:
    inFile = csv.reader(open(getName, 'rb'), delimiter=',', quotechar='|')
    tweet_supetset.update(entry[0] for entry in inFile)
    # using a set means we ignore any duplicate tweets in the 49 source files.

length = len(tweet_superset)
print length

seen_tweets = set()
for entry in mainFile:
    id_ = entry[0]
    if id_ in tweet_superset:
        if id_ in seen_tweets:
            print "Error, this tweet appears more than once in mainFile:", entry
        else:
            opFile.writerow(entry)
            seen_tweets.add(id_)

count = len(seen_tweets)
print count

在Python

3 个答案: