根据出现顺序解析字符串

时间:2017-07-11 19:06:27

标签: python pandas numpy

我有类似下面SampleDf的数据,我正在尝试创建代码,以便在每个字符串中选择它运行的第一个'Avg','Sum'或'Count'并将其放入一个新的列'Agg'。我下面的代码几乎可以做到,但它有一个层次结构。所以在我下面的代码中,如果Count在Sum之前,它仍然将Sum放在'Agg'列中。我有一个OutputDf,显示了我希望得到的内容。

Sample Data:

SampleDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)"],['bob',"isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and  [Value1] in ('HM') then  Count(LOS) end),0)"]],columns=['ReportField','OtherField'])

Sample Output:

OutputDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)",'Avg'],['bob',"isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and  [Value1] in ('HM') then  Count(LOS) end),0)",'Sum']],columns=['ReportField','OtherField','Agg'])


Code:
import numpy as np

    SampleDf['Agg'] = np.where(SampleDf.SQLTranslation.str.contains("Sum"),"Sum",
                              np.where(SampleDf.SQLTranslation.str.contains("Count"),"Count",
                                      np.where(SampleDf.SQLTranslation.str.contains("Avg"),"Avg","Nothing")))

1 个答案:

答案 0 :(得分:1)

对此问题进行快速而肮脏的尝试将是编写一个返回的函数:
- 任何感兴趣的术语,即[&#39; Avg&#39;,&#39; Sum&#39;&#39; Count&#39;],首先发生,如果它出现在字符串中<登记/> - 或import re terms = ['Avg','Sum','Count'] def extractTerms(s, t=terms): s_clean = re.sub("[^\w]|[\d]"," ", s).split() s_array = [w for w in s_clean if w in t] try: return s_array[0] except: return None ,如果没有这样的话:

SampleDf['Agg'] = SampleDf['OtherField'].apply(lambda s: extractTerms(s))
SampleDf

ReportField OtherField  Agg
0   tom Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)   Avg
1   bob isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0)  Sum

证明字符串中的术语:

SampleDf['Agg'] = SampleDf['OtherField'].apply(lambda s: extractTerms(s))
SampleDf

ReportField OtherField  Agg
0   tom foo None
1   bob isnull(Sum(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then Count(LOS) end),0)  Sum

证明条款不在字符串中

[[  4  11  14 ..., 355 360 364]
 [  2  13  15 ..., 356 361 361]
 [  4  12  18 ..., 356 361 365]
 ..., 
 [  6   9  17 ..., 356 362 364]
 [  1  10  19 ..., 352 357 360]
 [  1   9  17 ..., 356 358 364]]