Question

Google语音文本的字符数限制为5000个，而我的文本字符数为5万个字符。我需要根据给定的限制对字符串进行分块，而不会切断单词。

“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

如何在不切断单词的情况下将上述字符串分成不超过20个字符的字符串列表？

我看了NLTK库分块部分，没有看到任何内容。

Answer 1

一个基本的python方法将在前面查找20个字符，找到可能的最后一位空格，然后在该处剪切行。这不是一个令人难以置信的优雅实现，但它应该可以做到：

orig_string = “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”
list_of_lines = []
max_length = 20
while len(orig_string) > max_length:
    line_length = orig_string[:max_length].rfind(' ')
    list_of_lines.append(orig_string[:line_length])
    orig_string = orig_string[line_length + 1:]
list_of_lines.append(orig_string)

Answer 2

这与Green Cloak Guy类似，但是使用生成器而不是创建列表。对于大文本，这应该对内存更友好一些，并允许您懒惰地遍历这些块。您可以使用list()将其转换为列表，也可以在需要迭代器的地方使用它：

s = "Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news."

def get_chunks(s, maxlength):
    start = 0
    end = 0
    while start + maxlength  < len(s) and end != -1:
        end = s.rfind(" ", start, start + maxlength + 1)
        yield s[start:end]
        start = end +1
    yield s[start:]

chunks = get_chunks(s, 25)

#Make list with line lengths:
[(n, len(n)) for n in chunks]

结果

[('Well, Prince, so Genoa', 22),
 ('and Lucca are now just', 22),
 ('family estates of the', 21),
 ('Buonapartes. But I warn', 23),
 ('you, if you don’t tell me', 25),
 ('that this means war, if', 23),
 ('you still try to defend', 23),
 ('the infamies and horrors', 24),
 ('perpetrated by that', 19),
 ('Antichrist—I really', 19),
 ('believe he is', 13),
 ('Antichrist—I will have', 22),
 ('nothing more to do with', 23),
 ('you and you are no longer', 25),
 ('my friend, no longer my', 23),
 ('‘faithful slave,’ as you', 24),
 ('call yourself! But how do', 25),
 ('you do? I see I have', 20),
 ('frightened you—sit down', 23),
 ('and tell me all the news.', 25)]

Answer 3

您可以按以下方式使用nltk.tokenize方法：

import nltk

corpus = '''
Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.” 
'''

tokens = nltk.tokenize.word_tokenize(corpus)

或

sent_tokens = nltk.tokenize.sent_tokenize(corpus)

Python：在给定字符数限制的情况下，将长文本拆分为字符串块

3 个答案: