Question

我正在使用名为nlpnet的python库。这个库是来自巴西葡萄牙语的单词的标记，经过许多诱惑，在终端中实现了结果： Output of tagged data in terminal

我们可以在终端的图像中感知到什么，并且它使用其语法类的缩写单独地对每个单词进行分类。挑战在于算法搜索整个分析文档并仅重写包含我选择的某些语法类别超过5个单词的句子。

示例：使用多个句子分析txt文档，并在另一个文件中仅重写具有5个以上动词或形容词的句子。

使用的代码：准备贴标机的班级：

#!/usr/bin/python
# -*- coding: utf8 -*-
import nlpnet


def get_tags(content):
    #Labeling templates directory
    data_dir = 'pos-pt';
    #Definition of the directory and language to be used
    tagger = nlpnet.POSTagger(data_dir, language='pt');

    for i in range(content.__len__()):
        str = content[i];
        # Método para a etiquetação da sentença
        tagged_str = tagger.tag(str);
        print(tagged_str);

    return content;

文件类：`

#!/usr/bin/python
# -*- coding: utf8 -*-
import codecs
import teste


def loadContent():
    # Loading data set
    positiveData = codecs.open('opiniaoaborto.txt', 'r', encoding='utf8').readlines()

    data_set = [0 for i in range(2000)]
    label_set = [0 for i in range(2000)]

    data_set[:1000] = positiveData

    for i in range(2000):
        if i < 1000:
            label_set[i] = "p"
        else:
            label_set[i] = "n"

        # returning X feature set, y
    return data_set, label_set


content, label = loadContent()

content = teste.get_tags(content)

Answer 1

如果您只想对文档的句子进行POS标记，并将包含超过N次出现的特定POS的句子转储到文件中，则不需要发布的第二个脚本。

这是一个极其简化的例子：

import os
import nlpnet

TAGGER = nlpnet.POSTagger('pos-pt', language='pt')


# You could have a function that tagged and verified if a
# sentence meets the criteria for storage.

def is_worth_saving(text, pos, pos_count):
    # tagged sentences are lists of tagged words, which in
    # nlpnet are (word, pos) tuples. Tagged texts may contain
    # several sentences.
    pos_words = [word for sentence in TAGGER.tag(text)
                 for word in sentence
                 if word[1] == pos]
    return len(pos_words) >= pos_count


# Then you'd just need to open your original file, read a sentence, tag
# it, decide if it's worth saving, and save it or not. Until you consume 
# the entire original file. Thus not loading the entire dataset in memory 
# and keeping a small memory footprint.

with open('opiniaoaborto.txt', encoding='utf8') as original_file:
    with open('oracaos_interessantes.txt', 'w') as output_file:
        for text in original_file:
            # For example, only save sentences with more than 5 verbs in it
            if is_worth_saving(text, 'V', 5):
                output_file.write(text + os.linesep)

回答您的后续行动。您想要检查一个句子是否包含5个用给定列表中的任何POS标记的单词。我想了两个场景：

A）5个字必须属于同一个POS。例如，句子有5个动词（'Comendo，dançando，procurando，olhando e falando'）或5个名词（'O gato，o sapo，ocão，o loro eoratooraoraforam as compras'）但不是5个动词+名词（ 'O gato esta querendo comeroratão'[2名词，3个动词]）。

import os
import nlpnet
from collections import Counter

TAGGER = nlpnet.POSTagger('pos-pt', language='pt')    

# the POS arguments would need to be a list now
def is_worth_saving(text, pos_list, pos_count):
    interesting_words = Counter()
    for sentence in TAGGER.tag(text):
        for word, pos in sentence:
            if pos in pos_list:
                interesting_words[pos] += 1

    return any(particular_pos_count >= pos_count
               for _, particular_pos_count
               in interesting_words.items())

# Since the argument which receives our desired POS categories takes
# lists of POS categories now, we also have to change the way we
# invoke `is_worth_saving`

with open('opiniaoaborto.txt', encoding='utf8') as original_file:
    with open('oracaos_interessantes.txt', 'w') as output_file:
        for text in original_file:
            # For example, only save sentences with more than 5 verbs or nouns in it
            if is_worth_saving(text, ['V', 'N'], 5):  # Notice the POS argument takes lists now
                output_file.write(text + os.linesep)

B）句子包含5个POS，由列表中任何POS的总和组成。例如：'O gato esta querendo comeroratão'（2个名词+3个动词）

import os
import nlpnet

TAGGER = nlpnet.POSTagger('pos-pt', language='pt')

# Again, one of the arguments would have to take a list of valid POS
def is_worth_saving(text, pos_list, pos_count):
    pos_words = [word for sentence in TAGGER.tag(text)
                 for word in sentence
                 if word[1] in pos_list]
    return len(pos_words) >= pos_count

with open('opiniaoaborto.txt', encoding='utf8') as original_file:
    with open('oracaos_interessantes.txt', 'w') as output_file:
        for text in original_file:
            # For example, only save sentences whose sum of verbs and nouns count is 5
            if is_worth_saving(text, ['V', 'N'], 5):
                output_file.write(text + os.linesep)

Answer 2

在命令行中运行程序时，请写$python python_filename.py > savingfilename.txt。这会将屏幕上打印的所有内容保存为文本文件。

如何在txt中重写终端的结果

2 个答案: