Question

我有一个包含在.txt文件中的12.000个词典条目列表（仅限单词，没有它们的定义）。

我有一个完整的字典，其中有62.000个条目（带有定义的单词）存储在.csv文件中。

我需要将.txt文件中的小列表与.csv文件中的较大列表和删除行进行比较，其中包含不包含的条目出现在较小的列表中。换句话说，我想将此字典清除为仅12.000个条目。

.txt文件按照这样的单独行排序，逐行：

WORD1

WORD2

WORD3

.csv文件的排序方式如下：

ID （第1列） WORD （第2栏）含义（第3栏）

如何使用Python实现此目的？

Answer 1

以下内容不能很好地扩展，但应该适用于指示的记录数。

import csv

csv_in = csv.reader(open(path_to_file, 'r'))
csv_out = csv.writer(open(path_to_file2, 'w'))
use_words = open(path_to_file3, 'r').readlines()

lookup = dict([(word, None) for word in use_words])

for line in csv_in:
    if lookup.has_key(line[0]):
        csv_out.writerow(line)

csv_out.close()

Answer 2

到目前为止答案很好。如果你想获得简约......

import csv

lookup = set(l.strip().lower() for l in open(path_to_file3))
map(csv.writer(open(path_to_file2, 'w')).writerow, 
    (row for row in csv.reader(open(path_to_file)) 
    if row[1].lower() in lookup))

Answer 3

当前计算机最不为人知的事实之一是，当您从文本文件中删除一行并保存文件时，编辑器大多数时间都这样做：

将文件加载到内存
使用您想要的行编写一个临时文件
关闭文件并将温度移到原始

所以你必须加载你的词表：

with open('wordlist.txt') as i:
    wordlist = set(word.strip() for word in i)  #  you said the file was small

然后打开输入文件：

with open('input.csv') as i:
    with open('output.csv', 'w') as o:
        output = csv.writer(o)
        for line in csv.reader(i):  # iterate over the CSV line by line
            if line[1] not in wordlist:  # test the value at column 2, the word
                output.writerow(line) 

os.rename('input.csv', 'output.csv')

这是未经测试的，如果你发现任何错误，现在去做你的作业并在这里评论......： - ）

Answer 4

我会为此使用熊猫。数据集不大，所以你可以在内存中做到没有问题。

import pandas as pd

words = pd.read_csv('words.txt')
defs = pd.read_csv('defs.csv')
words.set_index(0, inplace=True)
defs.set_index('WORD', inplace=True)
new_defs = words.join(defs)
new_defs.to_csv('new_defs.csv')

你可能需要操纵new_defs使它看起来像你想要的那样，但这就是它的要点。

与使用Python的txt文件中的列表相比，如何从csv文件中删除行？

4 个答案: