使用python和打印匹配比较两个csv文件中的第一列

时间:2014-12-01 21:56:24

标签: python csv match nltk

我有两个csv文件,每个文件包含如下所示的ngram:

drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8

它是一个三字短语,后跟一个频率编号,后跟一个相对频率编号。

我想编写一个脚本,找到两个csv文件中的ngrams,划分它们的相对频率,然后将它们打印到新的csv文件中。我想让它在三个单词短语与另一个文件中的三个单词短语匹配时找到匹配,然后将第一个csv文件中短语的相对频率除以第二个csv文件中该相同短语的相对频率。然后我想打印短语和两个相对频率的划分到一个新的csv文件。

以下就我而言。我的脚本是比较线,但只有在整条线(包括频率和相对频率)完全匹配时才找到匹配。我意识到那是因为我找到了两套完整的交集,但我不知道如何以不同的方式做到这一点。请原谅我;我是编码的新手。任何帮助你可以让我更近一点将是一个很大的帮助。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))

matches = set(first_set).intersection(secnd_set)

c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)

print matches
print len(matches)

4 个答案:

答案 0 :(得分:1)

在新文件中没有转储res(繁琐)。这个想法是第一个元素是短语,另外两个是频率。使用dict代替set进行匹配和映射。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}

res = {}
for k,v in f_dict.items():
    if k in s_dict:
        res[k] = float(v[1])/float(s_dict[k][1])

print(res)

答案 1 :(得分:1)

  

我的脚本正在比较线,但只有当整条线(包括频率和相对频率)完全匹配时才会找到匹配。我意识到这是因为我找到了两套完整的交集,但我不知道如何做到这一点。

这正是字典的用法:当你有一个单独的键和值时(或者当只有部分值是键时)。所以:

a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}

现在,您无法在字典上直接使用set方法。 Python 3在这里提供了一些帮助,但你使用的是2.7。所以,你必须明确地写出来:

matches = {key for key in a_dict if key in b_dict}

或者:

matches = set(a_dict) & set(b_dict)

但你真的不需要这套;你想在这里做的就是迭代它们。所以:

for key in a_dict:
    if key in b_dict:
        a_values = a_dict[key]
        b_values = b_dict[key]
        do_stuff_with(a_values[2], b_values[2])

作为旁注,您实际上不需要首先构建列表,只是为了将它们变成集合或词组。只需建立集合或词组:

a_set = set()
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_set.add(tuple(row))

a_dict = {}
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_dict[row[0]] = row

另外,如果您了解理解,那么所有三个版本都迫切需要转换:

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    # Now any of these
    a_list = list(reader)
    a_set = {tuple(row) for row in reader}
    a_dict = {row[0]: row for row in reader}

答案 2 :(得分:1)

您可以将第一个文件的相对频率存储到字典中,然后遍历第二个文件,如果第一列与原始文件中的任何内容匹配,则将结果直接写入输出文件:

import csv

tmp = {}

# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
    # using tuple unpacking to extract fixed number of columns from each row
    for txt, abs, rel in csv.reader(fr):
        # converting strings like "1.435486010883783160220299732E-8"
        # to float numbers
        tmp[txt] = float(rel)

with open("matchedngrams.csv", "wb") as fw:
    writer = csv.writer(fw)

    # the 2nd input file will be processed per 1 line to save memory
    # the order of items from this file will be preserved
    with open("ngramstest.csv", "rb") as fr:
        for txt, abs, rel in csv.reader(fr):
            if txt in tmp:
                # not sure what you want to do with absolute, I use 0 here:
                writer.writerow((txt, 0, tmp[txt] / float(rel)))

答案 3 :(得分:0)

避免按原样保存小数字,它们会遇到下溢问题(请参阅What are arithmetic underflow and overflow in C?),将一个小数字除以另一个会给您带来更多的下溢问题,所以这样做可以预处理您的相对频率:

>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08

现在阅读csv。由于你只是操纵相对频率,你的第二列并不重要,所以让我们跳过它并将第一列(即短语)保存为键和第三列(即相对频率)作为价值:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))

现在对于棘手的部分,你想要通过ngramdict1的短语划分ngramdict2短语的相对频率,即:

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1

由于我们将相对频率保持在对数单位,因此我们不必划分,而是简单地减去它,即

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1

要获得两者中出现的短语,您不需要逐个检查短语,只需使用dictionary.keys()set1.intersection(set2)转换为集合,然后执行phrases1 = set(ngramdict1.keys()) phrases2 = set(ngramdict2.keys()) overlap_phrases = phrases1.intersection(phrases2) print overlap_phrases ,请参阅https://docs.python.org/2/tutorial/datastructures.html

set(['drinks while strutting', 'the state face', 'and since that'])

[OUT]:

with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

现在让我们用相对频率打印出来:

ngramcombined.csv

drinks while strutting,-0.69314718056 the state face,-1.09861228867 and since that,-0.69314718056 看起来像这样:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))


# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

这里是完整的代码:

import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}

# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')

如果您喜欢SUPER UNREADBLE但是短代码(在行数中):

{{1}}