无法使用正则表达式解析某些内容

时间:2018-06-18 11:29:04

标签: python regex python-3.x

我尝试使用re模块与python一起创建脚本,以便长时间解析addressphoneemail中间有换行符的字符串。那里有两套容器。当我运行我的脚本时,它给了我第一个容器的结果,更不用说不需要的部分了。我不知道我在下面试过的方式是任何有效的尝试!任何帮助将受到高度赞赏。

我试过了:

import re

rstr = """
    Address The Westshore Grand,
    A Tribute Portfolio Hotel, Tampa

    Telephone 52 70 90 00
    E-mail info.suchona@gmail.com


    Address hotels near 1255 north palm ave 
    sarasota florida

    Telephone 62 40 80 00
    E-mail info.niit@hotmail.com
"""
address = re.findall(r'(Address.+)',rstr)[0].strip()
phone = re.findall(r'(Telephone.+)',rstr)[0].strip()
email = re.findall(r'(E-mail.+)',rstr)[0].strip()
print(f'{address}\n{phone}\n{email}')

结果我有:

Address The Westshore Grand,
Telephone 52 70 90 00
E-mail info.suchona@gmail.com

我希望拥有的内容:

The Westshore Grand, A Tribute Portfolio Hotel, Tampa
52 70 90 00
info.suchona@gmail.com

hotels near 1255 north palm ave sarasota florida
62 40 80 00
info.niit@hotmail.com

虽然我知道可以通过字符串操作来实现,但我喜欢遵循regex方式。感谢。

3 个答案:

答案 0 :(得分:1)

Try this regex to get your address.

address = re.findall(r'(?<=Address).*?(?=Telephone)',rstr, flags=re.DOTALL)

Demo:

address = re.findall(r'(?<=Address).*?(?=Telephone)',rstr, flags=re.DOTALL)
phone = re.findall(r'(Telephone.+)',rstr)
email = re.findall(r'(E-mail.+)',rstr)
for i in zip(address, phone, email):
    print('{address}\n{phone}\n{email}'.format(address=i[0].strip(), phone=i[1].strip(), email=i[2].strip()))
    print( "-----" )

Output:

The Westshore Grand,
    A Tribute Portfolio Hotel, Tampa
Telephone 52 70 90 00
E-mail info.suchona@gmail.com
-----
hotels near 1255 north palm ave 
    sarasota florida
Telephone 62 40 80 00
E-mail info.niit@hotmail.com
-----

答案 1 :(得分:0)

You need to make your RegEx capture group surround only what you want. And re.findall() returns all occurrences of the matched RegEx pattern, so you could simply loop through them like so (assuming all three information are always there):

address = re.findall(r'Address(.+?)\n\n', rstr, flags=re.S)
phone = re.findall(r'Telephone(.+)', rstr)
email = re.findall(r'E-mail(.+)', rstr)

for i in range(len(address)):
    print('\n'.join([
        re.sub('\s{2,}', ' ', address[i].strip()),
        phone[i].strip(),
        email[i].strip()
    ]))

Output:

The Westshore Grand, A Tribute Portfolio Hotel, Tampa
52 70 90 00
info.suchona@gmail.com

hotels near 1255 north palm ave sarasota florida
62 40 80 00
info.niit@hotmail.com

答案 2 :(得分:0)

  • 您想要匹配换行符:使用re.DOTALL

  • 您还想抓住addresstelephone之间的所有内容,但要非贪婪.+?

  • 此外,您希望将其存储为一个组,因此请换入()

  • 只用空格替换所有空格:re.sub

结果

addresses = [re.sub(r'\s+', r' ', addr) 
             for addr in re.findall(r'Address (.+?)Telephone', rstr, re.DOTALL)]

输出

['The Westshore Grand, A Tribute Portfolio Hotel, Tampa',
 'hotels near 1255 north palm ave sarasota florida']

也做

phones = re.findall(r'Telephone\s*(.+)\s*', rstr)
emails = re.findall(r'E-mail\s*(.+)\s*', rstr)

然后你可以循环它们:

for addr, phone, email in zip(addresses, phones, emails):
    print(addr, phone, email, sep='\n', end='\n\n')

<强>输出

The Westshore Grand, A Tribute Portfolio Hotel, Tampa 
52 70 90 00
info.suchona@gmail.com

hotels near 1255 north palm ave sarasota florida 
62 40 80 00
info.niit@hotmail.com