所以我试图让我的程序从文本文件中打印出每个单词和标点符号的索引。我做了那个部分。 - 但问题是当我试图用标点符号使用这些索引位置重新创建原始文本时。这是我的代码:
with open('newfiles.txt') as f:
s = f.read()
import re
#Splitting string into a list using regex and a capturing group:
matches = [x.strip() for x in re.split("([a-zA-Z]+)", s) if x not in ['',' ']]
print (matches)
d = {}
i = 1
list_with_positions = []
# the dictionary entries:
for match in matches:
if match not in d.keys():
d[match] = i
i+=1
list_with_positions.append(d[match])
print (list_with_positions)
file = open("newfiletwo.txt","w")
file.write (''.join(str(e) for e in list_with_positions))
file.close()
file = open("newfilethree.txt","w")
file.write(''.join(matches))
file.close()
word_base = None
with open('newfilethree.txt', 'rt') as f_base:
word_base = [None] + [z.strip() for z in f_base.read().split()]
sentence_seq = None
with open('newfiletwo.txt', 'rt') as f_select:
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
print(' '.join(sentence_seq))
正如我所说,第一部分工作正常,但后来我得到了错误: -
Traceback (most recent call last):
File "E:\Python\Indexes.py", line 33, in <module>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
File "E:\Python\Indexes.py", line 33, in <listcomp>
sentence_seq = [word_base[int(i)] for i in f_select.read().split()]
IndexError: cannot fit 'int' into an index-sized integer
当程序通过代码底部的'sentence_seq'
时,会发生此错误newfiles是原始文本文件 - 一个随机文章,其中包含多个带标点符号的句子
list_with_positions是列表,其中包含每个单词在原始文本中出现的实际位置
匹配是分隔的不同单词 - 如果文件中的单词重复(他们这样做)匹配应该只有不同的单词。
有谁知道我为什么会收到错误,因为我无法弄清问题是什么。谢谢: - )
答案 0 :(得分:1)
您的方法的问题是使用''.join()
,因为这会加入所有没有空格的内容。所以,当前的问题是你试图split()
实际上是一个没有空格的长数字系列;你得到的是100+位的单个值。因此,int
在尝试将其用作索引时会溢出一个巨大的数字。更多的问题是指数可能会达到两位数等。当没有空格加入数字时,您是如何期望split()
处理的?
除此之外,你没有正确对待标点符号。尝试重建句子时,' '.join()
同样无效,因为你有逗号,句号等,两边都有空格。
我尽力坚持使用您当前的代码/方法(我不认为在尝试了解问题的来源时改变整个方法有很大的价值),但对我来说仍然感觉不好。我放弃了regex
,也许是需要的。我不是立即意识到有一个库来做这种事情,但几乎可以肯定必须有一个更好的方法
import string
punctuation_list = set(string.punctuation) # Has to be treated differently
word_base = []
index_dict = {}
with open('newfiles.txt', 'r') as infile:
raw_data = infile.read().split()
for index, item in enumerate(raw_data):
index_dict[item] = index
word_base.append(item)
with open('newfiletwo.txt', 'w') as outfile1, open('newfilethree.txt', 'w') as outfile2:
for item in word_base:
outfile1.write(str(item) + ' ')
outfile2.write(str(index_dict[item]) + ' ')
reconstructed = ''
with open('newfiletwo.txt', 'r') as infile1, open('newfilethree.txt', 'r') as infile2:
indices = infile1.read().split()
words = infile2.read().split()
reconstructed = ''.join([item + ' ' if item in punctuation_list else ' ' + item + ' ' for item in word_base])