Question

此问题是上一个问题How to extract only uppercase substring from pandas series?的后续问题。

我决定不问老问题，而是决定问新问题。

我的目的是从名为item的列中提取聚合方法agg和功能名称feat。

这是问题：


import numpy as np
import pandas as pd


df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})


regexp = (r'(?P<agg>) '     # agg is the word in uppercase (all other substring is lowercased)
         r'(?P<feat>), '   # 1. if there is no uppercase, whole string is feat
                           # 2. if there is uppercase the substring after example. is feat
                           # e.g. cat ==> cat
                           # cat.N_MOST_COMMON(example.ord)[2] ==> ord
                  
        )

df[['agg','feat']] = df.col.str.extract(regexp,expand=True)

# I am not sure how to build up regexp here.


print(df)

"""
Required output


                                item   agg               feat
0                                num                     num
1                               bool                     bool
2                                cat                     cat
3                 cat.COUNT(example)   COUNT                           # note: here feat is empty
4  cat.N_MOST_COMMON(example.ord)[2]   N_MOST_COMMON     ord
5             cat.FIRST(example.ord)   FIRST             ord
6             cat.FIRST(example.num)   FIRST             num
""";

Answer 1

对于feat，由于您已经在其他StackOverflow问题中得到了agg的答案，我想您可以根据以下两个不同的模式（以{ {1}}，然后|一个系列和另一个系列。

fillna()仅应在完整字符串为小写的情况下返回完整字符串
^([^A-Z]*$)仅在[^a-z].*example\.([a-z]+)\).*$之前的字符串中有大写字母的情况下，才应返回example.之后和)之前的字符串。

example.

以上内容为您提供了您要寻找样本数据并保持条件的输出。但是：

如果df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']}) s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True) df['feat'] = s[0].fillna(s[1]).fillna('') df Out[1]: item feat 0 num num 1 bool bool 2 cat cat 3 cat.COUNT(example) 4 cat.N_MOST_COMMON(example.ord)[2] ord 5 cat.FIRST(example.ord) ord 6 cat.FIRST(example.num) num之后有大写怎么办？当前输出将返回example.

请参见下面的示例2，其中一些数据根据上述要点进行了更改：

''

如何使用提取从熊猫数据框中提取大写字母和一些子字符串？

1 个答案: