Question

I have a program that concatenates words separated by an asterisk. The program removes the asterisk and connects the first part of the word (the one before the asterisk) with its second part (the one after the asterisk). It runs well except for one main problem: the second part (after the asterisk)is still in the output. For example, the program concatenated ['presi', '*', 'dent'], but 'dent' is still in the output. I did not figure it out where's the problem with my code. The code is below:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
import re
import os
import sys
from pathlib import Path


def main():
    while True:
        try:
            file_to_open =Path(input("\nPlease, insert your file path: "))

            with open(file_to_open) as f:
                words = word_tokenize(f.read().lower())
                break
        except FileNotFoundError:
            print("\nFile not found. Better try again")
        except IsADirectoryError:
            print("\nIncorrect Directory path.Try again")

    word_separator = '*'

    with open ('Fr-dictionary2.txt') as fr:
            dic = word_tokenize(fr.read().lower())

    def join_asterisk(ary):

        for w1, w2, w3 in zip(words, words[1:], words[2:]):
            if w2 == word_separator:
                word = w1 + w3
                yield (word, word in dic)
            elif w1 != word_separator and w1 in dic:
                yield (w1, True)


    correct_words = []
    incorrect_words = []
    correct_words = [w for w, correct in join_asterisk(words) if correct]
    incorrect_words = [w for w, correct in join_asterisk(words) if not correct]
    text=' '.join(correct_words)
    print(correct_words)
    print('\n\n', text)
    user2=input('\nWrite text to a file? Type "Y" for yes or "N" for no:')

    text_name=input("name your file.(Ex. 'my_first_file.txt'): ")
    out_file=open(text_name,"w")

    if user2 =='Y':
        out_file.write(text)
        out_file.close()
    else:
        print('ok')


main()

I wonder if anyone could help me to detect the error here?

Input example:

Les engage * ments du prési * dent de la Républi * que sont aussi ceux des dirigeants de la société » ferroviaire, a-t-il soutenu de vant des élus du Grand-Est réunis à l’Elysée.

Le président de la République, Emmanuel Macron (à droite), aux cô * tés du patron de la SNCF, Guillaume Pepy, à la gare Montparnasse, à Paris, le 1er juillet 2017. GEOFFROY VAN DER HASSELT / AFP

L’irrita tion qui, par fois, s’empare des usa * gers de la SNCF face aux trains suppri * més ou aux dessertes abandonnées semble avoir aussi saisi le président de la République. Devant des élus du Grand-Est, réunis mardi 26 février à l’Elysée dans le cadre du grand débat, Emmanuel Macron a eu des mots très durs contre la SNCF, qui a fermé la ligne Saint-Dié - Epinal le 23 décembre 2018, alors que le chef de l’Etat s’était engagé, durant un dépla * cement dans les Vosges effec * tué en avril 2018, à ce qu’elle reste opération * nelle.

Example of my current output is:

['les', 'engagements', 'du', 'président', 'dent', 'de', 'la', 'république', 'que', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']

Example of my desired output is:

['les', 'engagements', 'du', 'président', 'de', 'la', 'république', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']

Answer 1

这两个多余的词（我都认为）都在您的字典中，因此在for循环的2次迭代之后会再次产生，因为它们符合当行中变成w1时的情况：

            elif w1 != word_separator and w1 in dic:
                yield (w1, True)

重新设计join_asterisk函数似乎是实现此目的的最佳方法，因为任何试图修改此函数以跳过这些函数的尝试都是难以置信的。

以下是重新设计功能的一种方法，以便您可以跳过已经包含在单词中间的单词，这些单词的后半部分用'*'分隔：

incorrect_words = []
def join_asterisk(array):
    ary = array + ['', '']
    i, size = 0, len(ary)
    while i < size - 2:
        if ary[i+1] == word_separator:
            if ary[i] + ary[i+2] in dic:
                yield ary[i] + ary[i+2]
            else:
                incorrect_words.append(ary[i] + ary[i+2])
            i+=2
        elif ary[i] in dic: 
            yield ary[i]
        i+=1

如果您希望它更接近原始功能，则可以将其修改为：

def join_asterisk(array):
    ary = array + ['', '']
    i, size = 0, len(ary)
    while i < size - 2:
        if ary[i+1] == word_separator:
            concat_word = ary[i] + ary[i+2]
            yield (concat_word, concat_word in dic)
            i+=2
        else: 
            yield (ary[i], ary[i] in dic)
        i+=1

Answer 2

我认为join_asterisk的这种替代实现可以实现您的预期：

def join_asterisk(words, word_separator):
    if not words:
        return
    # Whether the previous word was a separator
    prev_sep = (words[0] == word_separator)
    # Next word to yield
    current = words[0] if not prev_sep else ''
    # Iterate words
    for word in words[1:]:
        # Skip separator
        if word == word_separator:
            prev_sep = True
        else:
            # If neither this or the previous were separators
            if not prev_sep:
                # Yield current word and clear
                yield current
                current = ''
            # Add word to current
            current += word
            prev_sep = False
    # Yield last word if list did not finish with a separator
    if not prev_sep:
        yield current

words = ['les', 'engagements', 'du', 'prési', '*', 'dent', 'de', 'la', 'républi', '*', 'que', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']
word_separator = '*'
print(list(join_asterisk(words, word_separator)))
# ['les', 'engagements', 'du', 'président', 'de', 'la', 'république', 'sont', 'aussi', 'ceux', 'des', 'dirigeants', 'de', 'la', 'société', 'ferroviaire']

problem with my for-loop combined with yield

2 个答案: