正则表达式功能的替代方案

时间:2017-01-28 18:32:14

标签: python regex list indexing append

这是我的代码,

import re
with open('newfiles.txt') as f:
   k = f.read()
p = re.compile(r'\w+|[^\w\-\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
   if word not in uniquelist:
       uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)
print('Here are the index positions of the text file : ' + indexes)

它需要一个文本文件(带有标点符号的几个随机句子),然后输出每个单词/标点符号出现位置的每个位置。如果某事重复两次,则显示第一个出现的位置。标点符号在此程序中被视为单个单词。

我试图使用代码,然后试图简化它。使用正则表达式函数我只需要两行代码来查找和分隔单词和标点符号,因此非常有效。但是,有没有人知道一个不那么复杂和简单的方法来做这个而不是使用正则表达式?请注意,如果你回答,请不要改变代码的其他部分,只是另一种方式来做同样的功能(显示单词的索引)而不是使用正则表达式。显然它会更长,所以无关紧要。

newfiles.txt

Parkour, also known as freerunning, is a relatively new sport founded by Sebastian Foucan, who showed off his skills in the James Bond movie "Casino Royale", which was released in 2006. Parkour is running, jumping over obstacles, or climbing over buildings and walls.
It is daring, breathtaking and at times terrifying, and now it is also an official sport in the UK, making the UK the first country in the world to recognise it. This means that people can teach parkour in schools.
Some people are worried about the sport being too dangerous, but the founder says that it is as safe as any sport, comparing to rugby, wrestling, surfing or climbing, but, - if you do not do it in the right way, you can get hurt.

输出

Here are the index positions of the text file : 1 2 3 4 5 6 2 7 8 9 10 11 12 13 14 15 2 16 17 18 19 20 21 22 23 24 25 26 27 28 26 2 29 30 31 21 32 33 1 7 34 2 35 36 37 2 38 39 36 40 41 42 33 43 7 44 2 45 41 46 47 48 2 41 49 50 7 3 51 52 11 21 22 53 2 54 22 53 22 55 56 21 22 57 58 59 50 33 60 61 62 63 64 65 66 21 67 33 68 63 69 70 71 22 11 72 73 74 2 75 22 76 77 62 50 7 5 78 5 79 11 2 80 58 81 2 82 2 83 38 39 2 75 2 84 85 86 87 86 50 21 22 88 89 2 85 64 90 91 33

谢谢

1 个答案:

答案 0 :(得分:0)

写“可读代码”真的很难。我仍然不明白为什么这样做,但这是一个很好的挑战:)我无法帮助自己并改变你构建一个独特集合的方式(使用OrderedDict):

import re
from collections import OrderedDict
import string
from numpy.testing.utils import assert_array_equal

k = '''Parkour, also known as freerunning, is a relatively new sport founded by Sebastian Foucan, who showed off his skills in the James Bond movie "Casino Royale", which was released in 2006. Parkour is running, jumping over obstacles, or climbing over buildings and walls.
It is daring, breathtaking and at times terrifying, and now it is also an official sport in the UK, making the UK the first country in the world to recognise it. This means that people can teach parkour in schools.
Some people are worried about the sport being too dangerous, but the founder says that it is as safe as any sport, comparing to rugby, wrestling, surfing or climbing, but, - if you do not do it in the right way, you can get hurt.'''

# the one you know
p = re.compile(r'\w+|[^\w\-\s]')
originaltext = p.findall(k)
uniquelist = []
for word in originaltext:
   if word not in uniquelist:
       uniquelist.append(word)
indexes = ' '.join(str(uniquelist.index(word)+1) for word in originaltext)


# the 'readable one'
w = string.ascii_uppercase + string.ascii_lowercase + "0123456789" 
originaltext2 = []
word = ""
for char in k:
    if char in " -\t\n\r\f\v":
        if word != "":
            originaltext2.append(word)
        word = ""
    elif char not in w:
        if word != "":
            originaltext2.append(word)
        originaltext2.append(char)
        word = ""
    else:
        word += char

uniquelist2 = OrderedDict.fromkeys(originaltext2).keys()
indexes2 = ' '.join(str(uniquelist2.index(word)+1) for word in originaltext2)

# same output
assert_array_equal(indexes, indexes2)