Question

我需要有关组织文本的帮助。我在csv中列出了数千个词汇。每个单词都有术语，定义和样本句子。术语和定义由制表符分隔，样本句子用空行分隔。

例如：

exacerbate  worsen

This attack will exacerbate the already tense relations between the two communities

exasperate  irritate, vex

he often exasperates his mother with pranks

execrable   very bad, abominable, utterly detestable

an execrable performance

我想组织这个，以便样本句子用双引号括起来，前后没有空行，句子中的术语用连字符替换。所有这些都在保持术语后的制表符，每个术语开头的新行以及定义和例句之间的唯一空格时发生变化。我需要这种格式将其导入flashcards web应用程序。

使用以上示例的期望结果：

exacerbate  worsen "This attack will – the already tense relations between the two communities"
exasperate  irritate, vex "he often – his mother with pranks"
execrable   very bad, abominable, utterly detestable "an – performance"

我正在使用Mac。我知道基本的命令行（包括正则表达式）和python，但还不足以自己解决这个问题。如果你能帮助我，我将非常感激。

Answer 1

将终端打开到您拥有输入文件的目录。将以下代码保存在.py文件中：

import sys
import string
import difflib
import itertools


with open(sys.argv[1]) as fobj:
    lines = fobj.read().split('\n\n')

with open(sys.argv[2], 'w') as out:
    for i in range(0, len(lines), 2):
        line1, example = lines[i:i + 2]
        words = [w.strip(string.punctuation).lower()
                 for w in example.split()]

        # if the target word is not in the example sentence,
        # we will find the most similar one
        target = line1.split('\t')[0]
        if target in words:
            most_similar = target
        else:
            most_similar = difflib.get_close_matches(target, words, 1)[0]
        new_example = example.replace(most_similar, '-')
        out.write('{} "{}"\n'.format(line1.strip(), new_example.strip()))

程序需要输入文件名和输出文件名作为命令行参数。也就是说，从终端执行以下命令：

$ python program.py input.txt output.txt

其中program.py是上述程序，input.txt是您的输入文件，output.txt是将使用您需要的格式创建的文件。

我针对您提供的示例运行程序。我手动添加了标签，因为在问题中只有空格。这是该程序产生的输出：

exacerbate  worsen "This attack will - the already tense relations between the two communities"
exasperate  irritate, vex "he often - his mother with pranks"
execrable   very bad, abominable, utterly detestable "an - performance"

在第二个示例中，程序正确地用exacerbates替换了短划线，即使单词是exacerbate。我无法保证这种技术可以在没有文件的情况下对文件中的每个单词起作用。

Answer 2

不一定是防弹，但此脚本将根据您的示例执行此任务：

int offset;
char line[1000];
FILE *fp;
char term_in[1000];
fp = fopen(argv[1], "r");

while (fgets(line, sizeof(line), fp) != NULL) {
    char *data = line;
    while (sscanf(line, " %s%n", term_in, &offset) == 1) {
        data += offset;
        printf("%s", term_in);
    }
}

输出：

import sys
import re
input_file = sys.argv[1]


is_definition = True

current_entry = ""
current_definition = ""

for line in open(input_file, 'r'):
    line = line.strip()

    if line != "":
        if is_definition == True:
            is_definition = False

            [current_entry, current_definition] = line.split("\t")

        else:
            is_definition = True

            example = line

            print (current_entry + "\t" + current_definition + ' "' + re.sub(current_entry + r'\w*', "-", line) + '"')

我们目前的方法存在的问题是它不适用于不规则动词，例如：＆＃34; go-go＆＃34;或者＆＃34;带来 - 带来＃34;或者＆＃34;寻求 - 寻求＆＃34;。

Answer 3

尝试：

suffixList = ["s", "ed", "es", "ing"] #et cetera
file = vocab.read()
file.split("\n")

vocab_words = [file[i] for i in range(0, len(file)-2, 4)]
vocab_defs = [file[i] for i in range(2, len(file), 4)]

for defCount in range(len(vocab_defs)):
    vocab_defs[defCount] = "\"" + vocab_defs[defCount] + "\""

newFileText = ""
for count in range(len(vocab_words)):
    vocab_defs[count] = vocab_defs[count].replace(vocab_words[count].split(" ")[0], "-")
    for i in suffixList:
        vocab_defs[count] = vocab_defs[count].replace("-%s" % i, "-")
    newFileText += vocab_words[count]
    newFileText += "  "
    newFileText += vocab_defs[count]
    newFileText += "\n"

new_vocab_file.write(newFileText)

输出：

============== RESTART: /Users/chervjay/Documents/thingy.py ==============
exacerbate  worsen  "This attack will - the already tense relations between the two communities"
exasperate  irritate, vex  "he often - his mother with pranks"
execrable   very bad, abominable, utterly detestable  "an - performance"

>>>

Answer 4

#!/usr/local/bin/python3

import re

with open('yourFile.csv', 'r') as myfile:
    data = myfile.read()    

print(re.sub(r'(^[A-Za-z]+)\t(.+)\n\n(.+)\1[s|ed|es|ing]*(.+)$',r'\1\t\2 "\3-\4"', data, flags = re.MULTILINE))

输出：

加剧恶化“这次攻击会 - 两族之间已经紧张的关系”
     恼怒的烦恼，烦恼“他经常 - 他的母亲带着恶作剧”
     可悲的非常糟糕，可恶，完全可憎的“表演”

在忽略空行的每一行的开头和结尾添加引号

4 个答案: