Question

我正在尝试从Python字符串中提取匹配的组，但遇到了问题。

该字符串如下所示。

1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc

我需要以数字和大写字母开头的任何内容作为标题，并提取该标题中的内容。

这是我期望的输出。

1. TITLE ABC Contents of title ABC and some other text
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title cdc

我尝试了以下正则表达式

(\d\.\s[A-Z\s]*\s)

并获得以下内容。

1. TITLE ABC 
2. TITLE BCD 
3. TITLE CDC

如果我尝试在正则表达式的末尾添加。*，则匹配组会受到影响。我想我在这里缺少一些简单的东西。尝试了我所知道但无法解决的所有问题。

感谢您的帮助。

Answer 1

使用(\d+\.[\da-z]* [A-Z]+[\S\s]*?(?=\d+\.|$))

下面是相关代码

import re
text = """1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""

result = re.findall('('
                    '\d+\.'   # Match a number and a '.' character
                    '[\da-z]*' # If present include any additional numbers/letters
                    '(?:\.[\da-z])*' # Match additional subpoints.
                                     # Each of these subpoints must start with a '.'
                                     # And then have any combination of numbers/letters
                    ' '   # Match a space. This is how we know to stop looking for subpoints, 
                          # and to start looking for capital letters
                    '[A-Z]+'  # Match at least one capital letter. 
                              # Use [A-Z]{2,} to match 2 or more capital letters
                    '[\S\s]*?'  # Match everything including newlines.
                                # Use .*? if you don't care about matching newlines
                    '(?=\d+\.|$)'  # Stop matching at a number and a '.' character, 
                                   # or stop matching at the end of the string,
                                   # and don't include this match in the results.
                    ')'
                    , text)

这是每个使用的正则表达式字符的more detailed explanation

Answer 2

在正则表达式中，您缺少字符组中的小写字母，因此它仅与大写单词匹配

您可以简单地使用此

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using MySql.Data.MySqlClient;

namespace Firemax
{
    public partial class Home : Form
    {
        public Home()
        {
            InitializeComponent();
            this.CenterToScreen();
        }

        private void BunifuImageButton1_Click(object sender, EventArgs e)
        {
            this.Close();
        }
    }
}

示例代码

(\d\.[\s\S]+?)(?=\d+\.|$)

输出

import re
text = """1. TITLE ABC Contents of 14 title ABC and some other text 2. TITLE BCD This would have contents on 
title BCD and maybe something else 3. TITLE CDC Contents of title cdc"""
result = new_s = re.findall('(\d\.[\s\S]+?)(?=\d+\.|$)', text)
print(result)

Regex demo

注意：- 您甚至可以将['1. TITLE ABC Contents of 14 title ABC and some other text ', '2. TITLE BCD This would have contents on \ntitle BCD and maybe something else ', '3. TITLE CDC Contents of title cdc']替换为[\s\S]+?，就好像您使用的是单行标记一样，因此{{1} }也将匹配换行符

Answer 3

您可以将<script> var video = document.getElementById('tv'), play = document.getElementById('fullscreenbutton'), time; video.addEventListener('webkitbeginfullscreen', function() { play.innerText = ''; window.clearInterval(time); }); video.addEventListener('webkitendfullscreen', function() { tv.autoplay(); }); play.addEventListener('touchstart', function() { time = window.setInterval(function() { try { video.webkitEnterFullscreen(); } catch(e) {} }, 250); play.innerText = 'loading ...'; tv.play(); }); </script>与re.findall一起使用：

re.split

输出：

import re
s = "1. TITLE ABC Contents of title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc"
t, c = re.findall('\d+\.\s[A-Z]+', s), list(filter(None, re.split('\d+\.\s[A-Z]+', s)))
result = [f'{a}{b}' for a, b in zip(t, c)]

Answer 4

import re
a=r'1. TITLE ABC Contents of 2title ABC and some other text 2. TITLE BCD This would have contents on title BCD and maybe something else 3. TITLE CDC Contents of title cdc'
res = re.findall('(\d\.\s[A-Za-z0-9\s]*\s)', a)
for e in map(str, res):
    print(e)

输出

1. TITLE ABC Contents of 2title ABC and some other text 
2. TITLE BCD This would have contents on title BCD and maybe something else 
3. TITLE CDC Contents of title

从字符串python正则表达式中提取匹配组

4 个答案: