我如何使用python计算文本文件的段落?

时间:2017-03-22 12:35:11

标签: python text paragraph

我正在尝试编写一本书密码解码器,以下是我到目前为止所做的。

code = open("code.txt", "r").read() 
my_book = open("book.txt", "r").read() 
book = my_book.txt 
code_line = 0 
while code_line < 6 :
      sl = code.split('\n')[code_line]+'\n'
      paragraph_num = sl.split(' ')[0]
      line_num =  sl.split(' ')[1]
      word_num = sl.split(' ')[2]
      x = x+1

循环改变了段落,行,单词变量,并且每件事都运行得很好。

但我现在需要的是如何指定段落然后是行然后单词 ,while循环中的for循环可以很好地工作。

所以我想从段号“paragraph_num”和行号“line_num”获得单词编号“word_num”

这是我的代码文件,我正在尝试将其转换为单词

  

“段号”,“行号”,“字号”

70 1 3
50 2 2
21 2 9
28 1 6
71 2 2
27 1 4

然后我希望我的输出看起来像这样

word 
word  
word 
word 
word 
word

我的书“我需要获取文字的文件”看起来像这样

  

单词单词单词单词单词单词单词单词单词   词词词词词词词词词词词词   词词词词词词词词词词词词   单词单词单词单词单词单词单词单词

     

单词单词单词单词单词单词单词单词单词   词词词词词词词词词词词词   词词词词词词词词词词词词   词词词词词词词词词词词词   单词单词单词

     

单词单词单词单词单词单词单词单词单词   词词词词词词词词词词词词   词词词词词词词词词词词词   单词单词单词

2 个答案:

答案 0 :(得分:2)

理论

如果您想从文字中删除段落,可以split "\n\n"

>>> "word\n\nword\nword\n\nword".split("\n\n")
['word', 'word\nword', 'word']

您现在有一个段落列表。对于每个段落,您可以按"\n"拆分并获取行列表。

对于每一行,您可以split不带参数并获取单词列表。

嵌套循环

text = """word word word word word word word word word
word word word word word word word
word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word

word word word word boat word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word
word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word"""

for i, paragraph in enumerate(text.split("\n\n")):
    for j, line in enumerate(paragraph.split("\n")):
        for k, word in enumerate(line.split()):
            print("%d, %d, %d : %s" % (i,j,k,word))

输出:

0, 0, 0 : word
0, 0, 1 : word
0, 0, 2 : word
0, 0, 3 : word
0, 0, 4 : word
0, 0, 5 : word
0, 0, 6 : word
0, 0, 7 : word
0, 0, 8 : word
0, 1, 0 : word
0, 1, 1 : word
0, 1, 2 : word
0, 1, 3 : word
0, 1, 4 : word
0, 1, 5 : word
0, 1, 6 : word
0, 2, 0 : word
0, 2, 1 : word
0, 2, 2 : word
0, 2, 3 : word
0, 2, 4 : word
0, 2, 5 : word
0, 2, 6 : word
0, 2, 7 : word
0, 2, 8 : word
0, 2, 9 : word
0, 2, 10 : word
0, 2, 11 : word
0, 2, 12 : word
0, 2, 13 : word
0, 2, 14 : word
0, 2, 15 : word
0, 2, 16 : word
0, 2, 17 : word
0, 2, 18 : word
0, 2, 19 : word
0, 2, 20 : word
0, 3, 0 : word
0, 3, 1 : word
0, 3, 2 : word
0, 3, 3 : word
0, 3, 4 : word
0, 3, 5 : word
0, 3, 6 : word
0, 3, 7 : word
0, 3, 8 : word
0, 3, 9 : word
0, 3, 10 : word
0, 3, 11 : word
0, 3, 12 : word
0, 3, 13 : word
0, 3, 14 : word
0, 3, 15 : word
0, 3, 16 : word
0, 3, 17 : word
1, 0, 0 : word
1, 0, 1 : word
1, 0, 2 : word
1, 0, 3 : word
1, 0, 4 : boat
1, 0, 5 : word
1, 0, 6 : word

循环对于查看所需的索引是有用的。

嵌套列表推导

如果要快速查找,可以使用嵌套列表推导来创建“3D列表”:

table = [[[word for word in line.split()] for line in paragraph.split("\n")] for paragraph in text.split("\n\n")]

输出:

[[['word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word']], [['word', 'word', 'word', 'word', 'boat', 'word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word']], [['word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word'], ['word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word', 'word']]]

您可以通过这种方式获得所需的字词:

table[1][0][4]
# "boat"

如果你有一个元组列表:

codes = [
        (1, 0, 4),
        (2, 1, 3)
        ]

for i,j,k in codes:
    print(table[i][j][k])

答案 1 :(得分:0)

如果有人想要另一个与众不同的代码

因为我坚信这与arnold/book cipher with python *

这里的“书密码”有关

我通过该链接在此处发布代码;如果这种理解是错误的,请告诉我。

# Replace "document1.txt" with whatever your book / document's name is.

BOOK="document1.txt" # This contains your "Word Word Word Word ...." I believed from the very start that you meant, they are not the same - (obviously)

# Read book into "boktxt"
def GetBookContent(BOOK):
    ReadBook = open(BOOK, "r")
    txtContent_splitted = ReadBook.read();
    ReadBook.close()
    Words=txtContent_splitted

    return(txtContent_splitted.split())


boktxt = GetBookContent(BOOK)

words=input("input text: ").split()
print("\nyou entered these words:\n",words)

i=0
words_len=len(words)
for word in boktxt:
    while i < words_len:
        print(boktxt.index(words[i]))
        i=i+1

x=0
klist=input("input key-sequence sep. With spaces: ").split()
for keys in klist:
        print(boktxt[int(klist[x])])
        x=x+1