我已经阅读了一个名为abc.txt的文件
现在,我想使用正则表达式将文件的文本分为这四个类别的单词。
文件abc.txt的文本是这样的:
**THE WIND IN THE WILLOWS BY KENNETH GRAHAME CONTENTS CHAPTER I. THE RIVER BANK II. THE OPEN ROAD III. THE WILD WOOD IV. MR. BADGER V. DULCE DOMUM VI. MR. TOAD VII. THE PIPER AT THE GATES OF DAWN VIII. TOAD'S ADVENTURES IX. WAYFARERS ALL X. THE FURTHER ADVENTURES OF TOAD XI. "LIKE SUMMER TEMPESTS CAME HIS TEARS" XII. THE RETURN OF ULYSSES
I。河岸
Mo鼠整天都在努力工作,春季大扫除 他的小家。首先用扫帚,然后用除尘器。然后在梯子上 和台阶和椅子,用刷子和一桶粉刷;直到他 他的喉咙和眼睛都沾满了灰尘,到处都是白粉 他的黑色皮毛,后背酸痛,手臂疲倦。春天来了 他上方和下方,周围的空气,甚至穿透 他那黑暗而低矮的小房子,充满了上帝的不满情绪 和渴望。难怪他突然摔了下来 他在地板上的刷子说:“兄弟!”和“哦,吹!”还有'Hang 春季大扫除!'甚至没等到 穿上外套。**
我尝试过的是:
import re
RE = (("([a-z])n’t\b","\1not"),("\bma’a?m\b","madam"),("W([a-z])-([a-z])","\1\2"),("-+"," "))
W = open("abc.txt","r")
W = W.read()
W
现在我得到以下输出:
我期望的是:
答案 0 :(得分:0)
尝试使用re.split
方法:
# Import regular expression operations
import re
# Text from the file
text = """** THE WIND IN THE WILLOWS
BY KENNETH GRAHAME
CONTENTS
CHAPTER
I.THE RIVER BANK
II.THE OPEN ROAD
III.THE WILD WOOD
IV.MR.BADGER
V.DULCE DOMUM
VI.MR.TOAD
VII.THE PIPER AT THE GATES OF DAWN
VIII.TOAD'S ADVENTURES
IX.WAYFARERS ALL
X.THE FURTHER ADVENTURES OF TOAD
XI."LIKE SUMMER TEMPESTS CAME HIS TEARS"
XII.THE RETURN OF ULYSSES
I.THE RIVER BANK"""
# Split text wherever one-or-more non-word characters occur
words = re.split(r'\W+', text)
其结果为:
In [1]: words
Out[1]: ['', 'THE', 'WIND', 'IN', 'THE', 'WILLOWS', 'BY', 'KENNETH', 'GRAHAME', 'CONTENTS', 'CHAPTER', 'I', 'THE', 'RIVER', 'BANK', 'II', 'THE', 'OPEN', 'ROAD', 'III', 'THE', 'WILD', 'WOOD', 'IV', 'MR', 'BADGER', 'V', 'DULCE', 'DOMUM', 'VI', 'MR', 'TOAD', 'VII', 'THE', 'PIPER', 'AT', 'THE', 'GATES', 'OF', 'DAWN', 'VIII', 'TOAD', 'S', 'ADVENTURES', 'IX', 'WAYFARERS', 'ALL', 'X', 'THE', 'FURTHER', 'ADVENTURES', 'OF', 'TOAD', 'XI', 'LIKE', 'SUMMER', 'TEMPESTS', 'CAME', 'HIS', 'TEARS', 'XII', 'THE', 'RETURN', 'OF', 'ULYSSES', 'I', 'THE', 'RIVER', 'BANK']