Question

一般来说，我想要做的是在＆＃34; word＆＃34;的共享列中提取公共元素。在几个csv文件中。（2008.csv，2009.csv，2010.csv ...... 2015.csv）

所有文件格式相同：＆＃39;字＆＃39;，＆＃39;计数＆＃39;

＆＃39;字＆＃39;包含特定年份的一个文档中的所有常用单词。

这是一个文件的快照：

file 2008.csv

只要有8个文件中有两个有共同的元素，我想知道那些共享元素以及它们所在的文件。（这非常像tfidf计算...顺便说一句）

无论如何，我的目标是了解这些文件中频繁出现单词的一些趋势。（据我所知，一个元素最多可以包含五个文件。）

我想知道它们第一次出现时的单词，即文件C中的单词，但文件B和A中都没有。

我知道+如果可能在这里解决问题，但这是相当繁琐的，我需要比较8个中的2个，8个中的3个，8个...中的4个，在这种情况下，找到共享元素。

这是我到目前为止编写的代码...远离我需要的...我只是比较8个文件中的两个中的元素： code

有人可以帮忙吗？

Answer 1

使用set intersection可能有帮助

for i in range(len(year_list)):
    datai=set(pd.read_csv('filename_'+year_list[i]+'.csv')['word'])
    tocompare=[]
    for j in range(i+1,len(year_list)):
        dataj=set(pd.read_csv('filename_'+year_list[j]+'.csv')['word'])
        print "Two files:",i,j
        print datai.intersection(dataj)
        tocompare.append(dataj)
    print "All compare:"
    print datai.intersection(*tocompare)
    break

Answer 2

第一个答案一般都很顺利。但由于某种原因，交点函数不会返回我预期的确切结果。因此，为了更准确和更好地打印输出格式，我修改了提供的代码。

for i in range(0,8):
otheryears = []
if i>0:
    for y in range(0,i):
        datay = set(pd.read_csv("most_50_common_words_"+year_list[y]+'.csv')["word"])
        for y in list(datay):
            if y not in otheryears:
                otheryears.append(y)     
uniquei = []
datai = set(pd.read_csv("most_50_common_words_"+year_list[i]+'.csv')["word"])
print "\nCompare year %d with:\n" % int(year_list[i])
for j in range(i+1,8):
    dataj = set(pd.read_csv("most_50_common_words_"+year_list[j]+'.csv')['word'])
    print year_list[j],':'
    listj = list(datai.intersection(dataj))
    print list(datai.intersection(dataj)),'\n',"%d common words with year %d" % (len(datai.intersection(dataj)),int(year_list[j]))
    for j in list(dataj):
        if j not in otheryears:
            otheryears.append(j)

common = []
for x in list(datai):
    if x in otheryears:
        common.append(x)   
print "\nAll compare:"
print "%d year has %d words in common with other years. They are as follows:\n%s" % (int(year_list[i]),
                                                                                     len(common),common),'\n'
for x in list(datai):
    if x not in otheryears:
        uniquei.append(x)
print "%d Frequent words unique in year %d:\n%s \n" % (len(uniquei),int(year_list[i]),uniquei)

提取多个列表中的共同元素

2 个答案: