从字符串中提取信息

时间:2014-04-01 13:03:35

标签: python string

在字符串中给出以下信息:

  

[:T102684-1 coord =“107,20,885,18”:] 27。[:/ T102684-1:] [:T102684-2   coord =“140,16,885,18”:] A。[:/ T102684-2:] [:T102684-3   坐标= “162,57,885,18”:]弗朗克[:/ T102684-3:] [:T102684-4   coord =“228,5,885,18”:]:[:/ T102684-4:] [:T102684-5   coord =“240,27,885,18”:]死[:/ T102684-5:] [:T102684-6   coord =“274,42,885,18”:] alpine [:/ T102684-6:] [:T102684-7   coord =“325,64,885,18”:] Literatur [:/ T102684-7:] [:T102684-8   coord =“398,25,885,18”:] des [:/ T102684-8:] [:T102684-9   coord =“427,46,885,18”:] Jahres [:/ T102684-9:] [:T102684-10   coord =“480,33,885,18”:] 1888 [:/ T102684-10:] [:T102684-11   坐标= “527,29,885,18”:] 475 [:/ T102684-11:]

如何提取Tab-ID(此处:T102684),Token-ID(“ - ”之后的数字),坐标(107,20,885,18)和令牌本身(“27.”) ? 我使用简单的查找方法,但它不起作用......

for tok in ele.text.split():
        print tok.find("[:T")
        print tok.rfind(":]")
        print tok[(tok.find("[:T")+2):tok.rfind("-")]

感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

您可以使用正则表达式:

>>> import re
>>> s = '[:T102684-1 coord="107,20,885,18":]27.[:/T102684-1:] [:T102684-2 coord="140,16,885,18":]A.[:/T102684-2:] [:T102684-3 coord="162,57,885,18":]Francke[:/T102684-3:][:T102684-4 coord="228,5,885,18":]:[:/T102684-4:] [:T102684-5 coord="240,27,885,18":]Die[:/T102684-5:] [:T102684-6 coord="274,42,885,18":]alpine[:/T102684-6:] [:T102684-7 coord="325,64,885,18":]Literatur[:/T102684-7:] [:T102684-8 coord="398,25,885,18":]des[:/T102684-8:] [:T102684-9 coord="427,46,885,18":]Jahres[:/T102684-9:] [:T102684-10 coord="480,33,885,18":]1888[:/T102684-10:] [:T102684-11 coord="527,29,885,18":]475[:/T102684-11:]'
>>> r = re.compile(r'''\[:/?T(?P<token_id>\d+)-(?P<id>\d+)\s+coord="
                    (?P<coord>(\d+,\d+,\d+,\d+))":\](?P<token>\w+)''', flags=re.VERBOSE)
>>> for m in r.finditer(s):
        print m.groupdict()


{'token_id': '102684', 'token': '27', 'id': '1', 'coord': '107,20,885,18'}
{'token_id': '102684', 'token': 'A', 'id': '2', 'coord': '140,16,885,18'}
{'token_id': '102684', 'token': 'Francke', 'id': '3', 'coord': '162,57,885,18'}
{'token_id': '102684', 'token': 'Die', 'id': '5', 'coord': '240,27,885,18'}
{'token_id': '102684', 'token': 'alpine', 'id': '6', 'coord': '274,42,885,18'}
{'token_id': '102684', 'token': 'Literatur', 'id': '7', 'coord': '325,64,885,18'}
{'token_id': '102684', 'token': 'des', 'id': '8', 'coord': '398,25,885,18'}
{'token_id': '102684', 'token': 'Jahres', 'id': '9', 'coord': '427,46,885,18'}
{'token_id': '102684', 'token': '1888', 'id': '10', 'coord': '480,33,885,18'}
{'token_id': '102684', 'token': '475', 'id': '11', 'coord': '527,29,885,18'}