基于空格分割文本

时间:2018-04-24 05:01:49

标签: python html regex parsing split

            No time. Not today.
                (slides in last bullets)
            Ten, eleven, twelve... or bust.
                (chambers a shell into each
                 gun, looks up)
            Right here!

The cab SCREECHES to a stop on the shoulder of the highest
FREEWAY in a massive INTERCHANGE of freeways. Dopinder halts
the meter and hands Deadpool his CARD.

我的目标是解析上面的文字,使对话与描述分开。我的文件中有多个这样的实例。输出应该是两个单独的字符串x和y,其中: x =“没时间。不是今天......就在这里!”和y =“驾驶室SCREECHES ......他的卡”。

如何使用正则表达式匹配实现此目的?或者有更好的方法来解决这个问题吗?我正在使用python。

2 个答案:

答案 0 :(得分:0)

使用 BeautifulSoup 解析网页上的内容。 根据所需标记提取内容更容易。使用正则表达式解析HTML不是一个好主意。

<强>演示:

from bs4 import BeautifulSoup
s = """<b>                          DEADPOOL (CONT'D) </b>                Little help?

    The cabbie grabs Deadpool's hand and pulls him through to the
    front. Deadpool's head rests upside down on the bench seat
    as he maneuvers his legs through. The cabbie turns the
    helping hand into a HANDSHAKE, then turns down the Juice.

<b>                            CABBIE </b>"""

soup = BeautifulSoup(s, "html.parser")
print(soup.text)

<强>输出:

答案 1 :(得分:0)

你似乎误认为字符串“小帮助?”为了“一点帮助?”。 x, y您要提取的内容是在同一个区块中由newlines\n\n)分隔的字符串。

你可以试试这个,

ss="""<b>                          DEADPOOL (CONT'D) </b>                Little help?

The cabbie grabs Deadpool's hand and pulls him through to the
front. Deadpool's head rests upside down on the bench seat
as he maneuvers his legs through. The cabbie turns the
helping hand into a HANDSHAKE, then turns down the Juice.

<b>                            CABBIE </b>"""
import re
regx=re.compile(r'(?s)(?<=\>)[^<>]*(?=\<)')
lst=[m.strip() for m in regx.findall(ss)]
xy=[m.strip() for m in re.split(r'\n{2}',lst[1])]
for i in xy: print(i+"\n")     # x=xy[0], y=xy[1]

输出是,

Little help?
The cabbie grabs Deadpool's hand and pulls him through to the
front. Deadpool's head rests upside down on the bench seat
as he maneuvers his legs through. The cabbie turns the
helping hand into a HANDSHAKE, then turns down the Juice.

已编辑以输入您的第二个附加问题。

ss="""copy&paste_Your_Input_string_Here"""
xy=[m.strip() for m in re.split(r'\n{2}',ss)]
for i in xy: print(i +"\n")     # x=xy[0], y=xy[1]