用Python中的莎士比亚文本解析

时间:2012-01-25 08:02:51

标签: python regex string

这是我要解析的源文本:

                       1
      From fairest creatures we desire increase,
      That thereby beauty's rose might never die,
      But as the riper should by time decease,
      His tender heir might bear his memory:
      But thou contracted to thine own bright eyes,
      Feed'st thy light's flame with self-substantial fuel,
      Making a famine where abundance lies,
      Thy self thy foe, to thy sweet self too cruel:
      Thou that art now the world's fresh ornament,
      And only herald to the gaudy spring,
      Within thine own bud buriest thy content,
      And tender churl mak'st waste in niggarding:
        Pity the world, or else this glutton be,
        To eat the world's due, by the grave and thee.


                         2
      When forty winters shall besiege thy brow,
      And dig deep trenches in thy beauty's field,
      Thy youth's proud livery so gazed on now,
      Will be a tattered weed of small worth held:  
      Then being asked, where all thy beauty lies,
      Where all the treasure of thy lusty days;
      To say within thine own deep sunken eyes,
      Were an all-eating shame, and thriftless praise.
      How much more praise deserved thy beauty's use,
      If thou couldst answer 'This fair child of mine
      Shall sum my count, and make my old excuse'
      Proving his beauty by succession thine.
        This were to be new made when thou art old,
        And see thy blood warm when thou feel'st it cold.


                         3
      Look in thy glass and tell the face thou viewest,
      Now is the time that face should form another,
      Whose fresh repair if now thou not renewest,
      Thou dost beguile the world, unbless some mother.
      For where is she so fair whose uneared womb
      Disdains the tillage of thy husbandry?
      Or who is he so fond will be the tomb,
      Of his self-love to stop posterity?  
      Thou art thy mother's glass and she in thee
      Calls back the lovely April of her prime,
      So thou through windows of thine age shalt see,
      Despite of wrinkles this thy golden time.
        But if thou live remembered not to be,
        Die single and thine image dies with thee.

我想把它解析成这样的块:

第一个块应该是:

  

我们渴望增加最公平的生物,     因此,美丽的玫瑰可能永远不会死,     但随着时间的推移,应该逐渐减少,     他温柔的继承人可能会记住他的记忆:     但是你收缩了自己明亮的眼睛,     用自给自足的燃料喂你的光芒,     在丰富的地方制造饥荒,     你自己的敌人,你的甜蜜自我太残忍了:     你那艺术现在是世界上新鲜的装饰品,     只有这个华丽的春天,     在你自己最萌芽的内容中,     并且招致churl mak'st浪费在niggarding:       可怜世界,否则这个贪食者,       要通过坟墓和你来吃世界。

第二个:

  当四十个冬天围困你的额头时,     在你美丽的领域挖掘深壕,     你年轻时骄傲的衣服如此凝视着,     将是一个破烂的小价值杂草:
    然后被问到,你的美丽在哪里,     你渴望的日子里所有的宝藏;     在你深深的沉没的眼睛里面说,     是一种无所谓的羞耻和无情的赞美。

第三个:

  

多少赞美值得你的美丽使用,     如果你能回答'我这个公平的孩子     应该算上我的数,并提出我的旧借口'     你的继承证明了他的美丽。       当你老了,这是新的,       当你感到寒冷时,看到你的血液温暖。

......等等。每当一个句子以.结尾时,我希望该部分成为一个新的块。

我该如何解析?我想要一些明确有效的方法来做到这一点。 我不想逐字逐句地去做一些检查......

由于

2 个答案:

答案 0 :(得分:1)

如果您不想逐个字符地检查,并且这完全是您拥有的来源,您可以逐行检查,并搜索空的。

根据实施情况,我不确定它会更有效率。可能相反。

答案 1 :(得分:1)

您可以将其拆分为:

re.split(r"(?:^|(?:[^\S\n]*\n){2}(?m)^)[ \t]+\d+[ \t]+[\r\n]+", text)