使用python正则表达式提取句子

时间:2017-03-02 09:36:19

标签: python regex

我有一个下面的降价文件:

#2016-12-24
| 单词 | 解释 | 例句 |
| --------- | -------- | --------- |
|**accelerator;**| - | - |
|**compass**| - | - |
|**wheels**| - | - |
|**fabulous**| - | - |
|**sweeping**| - | - |
|**prospect**| - | - |
|**pumpkin**| - | - |
|**trolley**| - | - |
|snapped,**| - | - |
|tip| - | - |
|lap| - | - |
|tether.| - | - |
|damp| - | - |
|triumphant| - | - |
|sarcastic| - | - |
|missed out| - | - |
|sidekick| - | - |
|considerable| - | - |
|Willow.| - | - |
|eagle.| - | - |
|considerably.| - | - |
|flat.| - | - |
|feast| - | - |
|scramble| - | - |
|turned up| - | - |
|rounded off| - | - |
|rat| - | - |
|resembled| - | - |
|By the time she had clambered back into the car,| - | - |
|By the time she had clambered back into the car, they were running very late,| - | - |
|wheeled his trolley| - | - |
|barrier,| - | - |
|bounced| - | - |
|in blazes| - | - |
|clutching| - | - |
|sealed| - | - |
|stunned.| - | - |
|‘We’re stuck,| - | - |
|marched off| - | - |
|accelerator| - | - |
|and the prospect of seeing Fred and George’s jealous faces| - | - |
|protest.| - | - |
|in protest.| - | - |
|horizon,| - | - |
|knuckles| - | - |
|metal| - | - |
|thick| - | - |
|reached the end of its tether.| - | - |
|Artefacts| - | - |
|blurted out.| - | - |
|gaped| - | - |
|I will be writing to both your families tonight.| - | - |
|‘Can you believe our luck, though?’| - | - |
|‘Skip the lecture,’| - | - |
|people’ll be talking about that one for years!’| - | - |
|nudged| - | - |
|‘I know I shouldn’t’ve enjoyed that or anything, but –’| - | - |
|dashed| - | - |

我想提取像:

这样的句子
  1. 当她爬上车时,
  2. 当她爬回车里时,他们跑得很晚,
  3. 推着他的推车
  4. '我们被卡住了,
  5. 和看到弗雷德和乔治的嫉妒面孔的前景
  6. 到达了系绳的末端。
  7. 今晚我会写信给你们的家人。
  8. '你能相信我们的运气吗?'
  9. '略过讲座',
  10. 人们会谈论那个多年!'
  11. '我知道我不应该喜欢这个或任何东西,但是 - '
  12. 我试图在regex101网站上这样做,但实际上每次都匹配所有。

    任何人都可以帮助我吗?

2 个答案:

答案 0 :(得分:1)

试试这个:

^\|[^\w\|]*(\w+\s+(?=\w+)[^\|]*)

Explanation

    如果该行以竖线(|)开头,则
  1. ^\|匹配
  2. [^\w\|]*抓住任何不在a-z0-9和|
  3. 中的内容
  4. \w+\s+确保后跟一个单词和一个或多个单词 白色空间
  5. (?=\w+)然后检查是否有更多要关注的字词
  6. [^\|]*如果找到先前的条件,那么抓住任何东西直到 下一个管道|
  7. 对于每场比赛,第1组包含您想要的句子

    Run the Code Sample here

答案 1 :(得分:0)

你可以提出:

^\|                     # start of line, followed by |
(                       # capture the "words"
    (?:[‘\w]+           # a non-capturing group and at least one of \w or ‘
        (?:[^|\w\n\r]+  # followed by NOT one of these
        |               # or
        (?=\|)          # make sure, there's a | straight ahead
    )
){2,})                  # repeat the construct at least 2 times
\|

请参阅a demo on regex101.com(并注意修饰符!) 这将至少捕获两个连续的字,如果您需要更多,请在{}括号中添加另一个数字。