Question

每8个字分割一个字符串。如果第8个字没有（。或！），移到下一个单词。

我可以从字符串中拆分单词。

with open("file.txt") as c:
    for line in c:
        text = line.split()
        n = 8
        listword = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]
        for lsb in listword:
            print(lsb)

预期输出应为

I'm going to the mall for breakfast, Please meet me there for lunch. 
The duration of the next. He figured I was only joking!
I brought back the time.

这就是我要得到的

I'm going to the mall for breakfast, Please
meet me there for lunch. The duration of 
the next. He figured I was only joking!
I brought back the time.

Answer 1

您正在将换行符添加到单词序列中。换行的主要条件是最后一个单词以.或!结尾。另外，还有关于最小长度（8个单词或更多）的辅助条件。以下代码将单词收集到缓冲区中，直到满足打印行的条件为止。

with open("file.txt") as c:
    out = []
    for line in c:
        for word in line.split():
            out.append(word)
            if word.endswith(('.', '!')) and len(out) >= 8:
                print(' '.join(out))
                out.clear()
    # don't forget to flush the buffer
    if out:
        print(' '.join(out))

Answer 2

您似乎并没有告诉您的代码寻找.或!，只是将文本分成8个单词的块。这是一种解决方案：

buffer = []
output = []

with open("file.txt") as c:
    for word in c.split(" "):
        buffer.append(word)
        if '!' in word or '.' in word and len(buffer) > 7:
            output.append(' '.join(buffer))
            buffer = []

print output

这将获得一个单词列表，在空格处分开。它将word添加到buffer直到满足您的条件（word包含标点符号并且缓冲区超过7个字）。然后将buffer附加到您的output并清除buffer。

我不知道您文件的结构，因此我用c作为一长串句子进行了测试。您可能需要对输入进行一些摆弄，以使其与代码所期望的一致。

Answer 3

我不确定如何通过理解列表来实现这一目标，但是您可以尝试使用常规的for循环来实现。

with open("file.txt") as c:
    for line in c:
        text = line.split()
        n = 8
        temp = []
        listword = []
        for val in text:
            if len(temp) < n or (not val.endswith('!') and not val.endswith('.')):
              temp.append(val)
            else:
                temp.append(val)
                listword.append(' '.join(temp))
                temp = []
        if temp:  # if last line has less than 'n' words, it will append last line
            listword.append(' '.join(temp))

for lsb in listword:
    print(lsb)

Answer 4

您可能已经知道，您尚未编写任何代码来检查标点符号。最好的方法可能是使用两个索引来跟踪要打印的部分的开始和结束。该部分必须至少包含8个单词，但如果在第8个单词上未找到标点符号，则必须更大。

n = 8
with open('file.txt') as c:
    for line in c:
        words = line.split()

        # Use two indexes to keep track of which section to print
        start = 0
        end = start + n
        while end < len(words):
            # At the last word of this section, if punctuation not found, advance end until punctuation found
            if '.' not in words[end - 1] and '!' not in words[end - 1]:
                for word in words[end:]:
                    if '.' in word or '!' in word:
                        break
                    end += 1
            print(' '.join(words[start:end + 1])) # print from start to end, including word at end
            start = end + 1 # advance start to one after last word
            end += n # advance end 8 more words
        print(' '.join(words[start:end])) # print the last section regardless of punctuation

结果：

I'm going to the mall for breakfast, Please meet me there for lunch.
The duration of the next. He figured I was only joking!
I brought back the time.

Python将字符串拆分为下一个标点

4 个答案: