使用正则表达式python

时间:2018-11-01 22:36:06

标签: python regex

我已经阅读了一个名为abc.txt的文件

现在,我想使用正则表达式将文件的文本分为这四个类别的单词。

  1. “ ...不是” =>“ ...不是”
  2. 像Mme这样的缩写。
  3. 合并像k-k-kick一样的口吃
  4. 在连字符处将单词分开。

文件abc.txt的文本是这样的:

**THE WIND IN THE WILLOWS
BY KENNETH GRAHAME
CONTENTS

CHAPTER
I.    THE RIVER BANK
II.   THE OPEN ROAD
III.  THE WILD WOOD
IV.   MR. BADGER
V.    DULCE DOMUM
VI.   MR. TOAD
VII.  THE PIPER AT THE GATES OF DAWN
VIII. TOAD'S ADVENTURES
IX.   WAYFARERS ALL
X.    THE FURTHER ADVENTURES OF TOAD
XI.   "LIKE SUMMER TEMPESTS CAME HIS TEARS"
XII.  THE RETURN OF ULYSSES
     

I。河岸

     

Mo鼠整天都在努力工作,春季大扫除   他的小家。首先用扫帚,然后用除尘器。然后在梯子上   和台阶和椅子,用刷子和一桶粉刷;直到他   他的喉咙和眼睛都沾满了灰尘,到处都是白粉   他的黑色皮毛,后背酸痛,手臂疲倦。春天来了   他上方和下方,周围的空气,甚至穿透   他那黑暗而低矮的小房子,充满了上帝的不满情绪   和渴望。难怪他突然摔了下来   他在地板上的刷子说:“兄弟!”和“哦,吹!”还有'Hang   春季大扫除!'甚至没等到   穿上外套。**

我尝试过的是:

import re
RE = (("([a-z])n’t\b","\1not"),("\bma’a?m\b","madam"),("W([a-z])-([a-z])","\1\2"),("-+"," "))
W = open("abc.txt","r")
W = W.read()
W

现在我得到以下输出:

enter image description here

我期望的是:

enter image description here

1 个答案:

答案 0 :(得分:0)

尝试使用re.split方法:

# Import regular expression operations
import re

# Text from the file
text = """** THE WIND IN THE WILLOWS
    BY KENNETH GRAHAME
    CONTENTS

    CHAPTER
    I.THE RIVER BANK
    II.THE OPEN ROAD
    III.THE WILD WOOD
    IV.MR.BADGER
    V.DULCE DOMUM
    VI.MR.TOAD
    VII.THE PIPER AT THE GATES OF DAWN
    VIII.TOAD'S ADVENTURES
    IX.WAYFARERS ALL
    X.THE FURTHER ADVENTURES OF TOAD
    XI."LIKE SUMMER TEMPESTS CAME HIS TEARS"
    XII.THE RETURN OF ULYSSES

    I.THE RIVER BANK"""

# Split text wherever one-or-more non-word characters occur
words = re.split(r'\W+', text)

其结果为:

In [1]: words
Out[1]: ['',  'THE',  'WIND',  'IN',  'THE',  'WILLOWS',  'BY',  'KENNETH',  'GRAHAME',  'CONTENTS',  'CHAPTER',  'I',  'THE',  'RIVER',  'BANK',  'II',  'THE',  'OPEN',  'ROAD',  'III',  'THE',  'WILD',  'WOOD',  'IV',  'MR',  'BADGER',  'V',  'DULCE',  'DOMUM',  'VI',  'MR',  'TOAD',  'VII',  'THE',  'PIPER',  'AT',  'THE',  'GATES',  'OF',  'DAWN',  'VIII',  'TOAD',  'S',  'ADVENTURES',  'IX',  'WAYFARERS',  'ALL',  'X',  'THE',  'FURTHER',  'ADVENTURES',  'OF',  'TOAD',  'XI',  'LIKE',  'SUMMER',  'TEMPESTS',  'CAME',  'HIS',  'TEARS',  'XII',  'THE',  'RETURN',  'OF',  'ULYSSES',  'I',  'THE',  'RIVER',  'BANK']