我需要为我的班级构建一个程序,它将:从文件中读取一个混乱的文本,并从输入中为该文本提供一个书本形式:
This is programing story , for programmers . One day a variable
called
v comes to a bar and ordred some whiskey, when suddenly
a new variable was declared .
a new variable asked : " What did you ordered? "
进入输出
This is programing story,
for programmers. One day
a variable called v comes
to a bar and ordred some
whiskey, when suddenly a
new variable was
declared. A new variable
asked: "what did you
ordered?"
我是编程的初学者,我的代码在这里
def vypis(t):
cely_text = ''
for riadok in t:
cely_text += riadok.strip()
a = 0
for i in range(0,80):
if cely_text[0+a] == " " and cely_text[a+1] == " ":
cely_text = cely_text.replace (" ", " ")
a+=1
d=0
for c in range(0,80):
if cely_text[0+d] == " " and (cely_text[a+1] == "," or cely_text[a+1] == "." or cely_text[a+1] == "!" or cely_text[a+1] == "?"):
cely_text = cely_text.replace (" ", "")
d+=1
def vymen(riadok):
for ch in riadok:
if ch in '.,":':
riadok = riadok[ch-1].replace(" ", "")
x = int(input("Zadaj x"))
t = open("text.txt", "r")
v = open("prazdny.txt", "w")
print(vypis(t))
此代码删除了一些空格,我试图在“。,_?”这样的符号之前删除空格。但这不起作用的原因?谢谢你的帮助:)
答案 0 :(得分:3)
你想要做很多事情,所以让我们按顺序进行:
让我们以漂亮的文本形式(字符串列表)获取文本:
>>> with open('text.txt', 'r') as f:
... lines = f.readlines()
>>> lines
['This is programing story , for programmers . One day a variable',
'called', 'v comes to a bar and ordred some whiskey, when suddenly ',
' a new variable was declared .',
'a new variable asked : " What did you ordered? "']
你周围有新行。让我们用空格替换它们,并将所有内容连接成一个大字符串:
>>> text = ' '.join(line.replace('\n', ' ') for line in lines)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'
现在我们要删除任何多个空格。我们按空格,制表符等分开......并且只保留非空单词:
>>> words = [word for word in text.split() if word]
>>> words
['This', 'is', 'programing', 'story', ',', 'for', 'programmers', '.', 'One', 'day', 'a', 'variable', 'called', 'v', 'comes', 'to', 'a', 'bar', 'and', 'ordred', 'some', 'whiskey,', 'when', 'suddenly', 'a', 'new', 'variable', 'was', 'declared', '.', 'a', 'new', 'variable', 'asked', ':', '"', 'What', 'did', 'you', 'ordered?', '"']
让我们用空格加入我们的话......(这次只有一次)
>>> text = ' '.join(words)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'
我们现在要删除所有<SPACE>.
,<SPACE>,
等...:
>>> for char in (',', '.', ':', '"', '?', '!'):
... text = text.replace(' ' + char, char)
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. a new variable asked:" What did you ordered?"'
好的,工作没有完成,因为"
仍然搞砸了,大写没有设置等...你仍然可以逐步更新你的文字。对于大写,请考虑例如:
>>> sentences = text.split('.')
>>> sentences
['This is programing story, for programmers', ' One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared', ' a new variable asked:" What did you ordered?"']
了解如何修复它? 诀窍是只采取字符串转换:
通过这种方式,您可以将它们组合起来,逐步改进文本。
一旦你有一个格式很好的文本,就像这样:
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. A new variable asked: "what did you ordered?"'
您必须定义类似的语法规则,以便以书本格式打印出来。例如考虑函数:
>>> def prettyprint(text):
... return '\n'.join(text[i:i+50] for i in range(0, len(text), 50))
它将打印每行精确长度为50个字符:
>>> print prettyprint(text)
This is programing story, for programmers. One day
a variable called v comes to a bar and ordred som
e whiskey, when suddenly a new variable was declar
ed. A new variable asked: "what did you ordered?"
不错,但可以更好。就像我们之前用文本,线条,句子和单词来处理英语语法的句法规则一样,想要完全相同,以匹配印刷书籍的句法规则。
在这种情况下,英语和印刷书籍都使用相同的单位:单词,以句子排列。这表明我们可能希望直接处理这些问题。一种简单的方法是定义自己的对象:
>>> class Sentence(object):
... def __init__(self, content, punctuation):
... self.content = content
... self.endby = punctuation
... def pretty(self):
... nice = []
... content = self.content.pretty()
... # A sentence starts with a capital letter
... nice.append(content[0].upper())
... # The rest has already been prettified by the content
... nice.extend(content[1:])
... # Do not forget the punctuation sign
... nice.append('.')
... return ''.join(nice)
>>> class Paragraph(object):
... def __init__(self, sentences):
... self.sentences = sentences
... def pretty(self):
... # Separating our sentences by a single space
... return ' '.join(sentence.pretty() for sentence in sentences)
等......通过这种方式,您可以将文字表示为:
>>> Paragraph(
... Sentence(
... Propositions([Proposition(['this',
... 'is',
... 'programming',
... 'story']),
... Proposition(['for',
... 'programmers'])],
... ',')
... '.'),
... Sentence(...
等...
从字符串(甚至是混乱的字符串)转换为这样的树是相对简单的,因为您只分解为最小的可能元素。如果要以书籍格式打印,可以在树的每个元素上定义自己的book
方法,例如:像这样,传递当前line
的当前lines
,输出offset
和当前line
:
class Proposition(object):
...
def book(self, line, lines, offset, line_length):
for word in self.words:
if offset + len(word) > line_length:
lines.append(' '.join(line))
line = []
offset = 0
line.append(word)
return line, lines, offset
...
class Propositions(object):
...
def book(self, lines, offset, line_length):
lines, offset = self.Proposition1.book(lines, offset, line_length)
if offset + len(self.punctuation) + 1 > line_length:
# Need to add the punctuation sign with the last word
# to a new line
word = line.pop()
lines.append(' '.join(line))
line = [word + self.punctuation + ' ']
offset = len(word + self.punctuation + ' ')
line, lines, offset = self.Proposition2.book(lines, offset, line_length)
return line, lines, offset
继续前进到Sentence
,Paragraph
,Chapter
......
这是一个非常简单的实现(实际上是一个非平凡的问题),没有考虑到音节化或理由(你可能会想要),但这是要走的路。
请注意,一旦您可以定义语法规则或转换,我就没有提到要使用工具的string module,string formatting或regular expressions。这些是非常强大的工具,但最重要的是要确切地知道将无效字符串转换为有效字符串的算法。一旦你有一些工作伪代码,regexps和格式字符串可以帮助你实现它,而不是简单的字符迭代。 (例如,在我之前的单词树例子中,正则表达式可以极大地简化树的构造,而Python强大的字符串格式化功能可以使book
或pretty
方法的写入更多更容易)。
答案 1 :(得分:1)
要剥离多个空格,可以使用简单的正则表达式替换。
import re
cely_text = re.sub(' +',' ', cely_text)
然后,对于标点符号,您可以运行类似的子项:
cely_text = re.sub(' +([,.:])','\g<1>', cely_text)