熊猫检查哪个子字符串在字符串列中

时间:2019-04-18 09:21:14

标签: python pandas dataframe

我正在尝试创建一个函数,该函数将在pandas数据框中创建一个新列,该函数将找出字符串列中的哪个子字符串,并获取该子字符串并将其用于新列。

问题在于要查找的文本没有出现在变量x的同一位置

 df = pd.DataFrame({'x': ["var_m500_0_somevartext","var_m500_0_vartextagain",
 "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6,8]})

finds = ["m500_0","0_500","m150_0"]

finds中的哪个在给定的df["x"]行中

我已经制作了一个可以正常工作的函数,但是对于大型数据集来说却非常慢

def pd_create_substring_var(df,new_var_name = "new_var",substring_list=["1"],var_ori="x"):
    import re
    df[new_var_name] = "na"
    cols =  list(df.columns)
    for ix in range(len(df)):
        for find in substring_list:
            for m in re.finditer(find, df.iloc[ix][var_ori]):
                df.iat[ix, cols.index(new_var_name)] = df.iloc[ix][var_ori][m.start():m.end()]
    return df


df = pd_create_substring_var(df,"t",finds,var_ori="x")

df 
                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

5 个答案:

答案 0 :(得分:3)

这能满足您的需求吗?

finds = ["m500_0", "0_500", "m150_0"]
df["t"] = df["x"].str.extract(f"({'|'.join(finds)})")

答案 1 :(得分:1)

可能不是最好的方法:

~$ free -h
              total        used        free      shared  buff/cache   available
Mem:            15G         14G        172M        520K        1.1G         77M
Swap:           15G        644M         15G

现在:

df['t'] = df['x'].apply(lambda x: ''.join([i for i in finds if i in x]))

是:

print(df)

现在,只需添加到@pythonjokeun的答案,您就可以做到:

                            x  x1       t
0      var_m500_0_somevartext   4  m500_0
1     var_m500_0_vartextagain   5  m500_0
2  varwithsomeothertext_0_500   6   0_500
3   varwithsomext_m150_0_text   8  m150_0

或者:

df["t"] = df["x"].str.extract("(%s)" % '|'.join(finds))

或者:

df["t"] = df["x"].str.extract("({})".format('|'.join(finds)))

答案 2 :(得分:1)

我不知道您的数据集有多大,但是您可以使用以下地图功能:

def subset_df_test():
  df = pandas.DataFrame({'x': ["var_m500_0_somevartext", "var_m500_0_vartextagain",
                         "varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6, 8]})

  finds = ["m500_0", "0_500", "m150_0"]
  df['t'] = df['x'].map(lambda x: compare(x, finds))
  print df

def compare(x, finds):
  for f in finds:
    if f in x:
        return f

答案 3 :(得分:1)

使用pandas.str.findall

df['x'].str.findall("|".join(finds))

0    [m500_0]
1    [m500_0]
2     [0_500]
3    [m150_0]

答案 4 :(得分:0)

尝试一下

**/Testng/target/testng-results.xml