Question

我正在尝试提取特殊字符>和单词模式.myword之间的所有字符（通常是多个单词，包括空格）。在我的熊猫数据框中。

我尝试了以下仅在.myword后面加上一个单词的情况：

df['my_column'] = df['text'].str.findall(r'(\w+.myword)')

一些字符串示例：

str1 = 123abc >I want this1.myword #extract I want this1.myword
str2 =  123<>I want this2.myword<> #extract I want this2.myword

Answer 1

首先，一个简单的点.会与任何字符匹配，因此您要在正则表达式中对其进行转义：\.否则，该正则表达式还会在例如：
中找到一个匹配项。 123>Iwantthis!myword # extracts Iwantthis!myword

第二，必须在捕获的组\s中允许空格字符。

我想这应该为您完成这项工作： r'([\w\s]+\.myword)'

Answer 2

我将定义一个特定的函数来提取子字符串，而不是使用正则表达式：

代码

def substring(original_string):
    start = original_string.find(">")
    end = original_string.find(".myword")

    if (start > -1) and (end > -1):
        return original_string[start + 1:end]
    else:
        return None


df['my_column'] = df['text'].apply(lambda x: substring(x))

Answer 3

$ grep -Po '(?<=>)[^<$]+' <<EOF
123abc >I want this1.myword
123<>I want this2.myword<>
EOF

I want this1.myword
I want this2.myword

(?<=)积极回望
[^]负字符集

RegEx用于提取特殊字符和单词之间的所有字符

3 个答案:

代码