Question

我想用python创建一个程序，该程序读取txt文件作为用户输入。然后，我希望程序在下面的示例中将单词分开，如下所示：

在他加入瑞典时，瑞典的瑞典议会拥有比君主制更大的权力，但在敌对政党之间却存在激烈的分歧。

当时
时间
他的时间
他的加入

我希望该程序将它们保存在另一个文件中。有什么想法吗？

Answer 1

您没有详细说明要将文本保存在其他文件中的格式。假设您要逐行操作，那就可以了：

def only_letters(word):
    return ''.join(c for c in word if 'a' <= c <= 'z' or 'A' <= c <= 'Z')

with open('input.txt') as f, open('output.txt', 'w') as w:
    s = f.read()
    words = [only_letters(word) for word in s.split()]
    triplets = [words[i:i + 3] for i in range(len(words) - 2)]
    for triplet in triplets:
        w.write(' '.join(triplet) + '\n')

Answer 2

您可以尝试执行此操作，请注意，如果您至少输入3个单词，它将失败。

def get_words():
    with open("file.txt", "r") as f:
        for word in f.readline().split(" "):
            yield word.replace(",", "").replace(".", "")

with open("output.txt", "w") as f:
    it = get_words()
    current = [""] + [next(it) for _ in range(2)]
    for word in it:
        current = current[1:] + [word]
        f.write(" ".join(current) + "\n")

Answer 3

我的理解是，您希望生成n-grams，这是在进行任何NLP之前进行文本向量化的一种常见做法。这是一个简单的实现：

from sklearn.feature_extraction.text import CountVectorizer

string = ["At the time of his accession, the Swedish Riksdag held more power than the monarchy but was bitterly divided between rival parties."]
# you can change the ngram_range to get any combination of words
vectorizer = CountVectorizer(encoding='utf-8', stop_words='english', ngram_range=(3,3))

X = vectorizer.fit_transform(string)
print(vectorizer.get_feature_names())

这将为您提供长度为3的ngram列表，但顺序会丢失。

[“瑞典人加入”，“当时”，“敌对方之间”，“相互之间苦不相干”，“却被苦中”，“相互敌对之间”，“拥有更多权力”，“他加入了”，“但有君主制”，“比他有更多的权力”，“比他有更多的权力”，“比里斯达格拥有更多权”，“瑞典的里斯达格拥有”，“比君主制”，“君主制不过”， “瑞典的riksdag”，“的时间”，“他的时间”，“分崩离析”]

从txt文件读取并分割单词

3 个答案: