我有一个字符串列(' b'),并希望获得与同一列中的子字符串类似的字符串。例如,在下面的数据框列中' b'世界是helloworld的子串,ness是伟大的子串。我想在列表中获取字符串世界和ness。能否请你提出解决方案。
a b
0 test world
1 teat helloworld
2 gor bye
3 jhr greatness
4 fre ness
列表中的所需输出
listofsubstrings
Out[353]: ['world', 'ness']
答案 0 :(得分:1)
您可以使用:
from itertools import product
#get unique values only
b = df.b.unique()
#create all combination
df1 = pd.DataFrame(list(product(b, b)), columns=['a', 'b'])
#filtering
df1 = df1[df1.apply(lambda x: x.a in x.b, axis=1) & (df1.a != df1.b)]
print (df1)
a b
1 world helloworld
23 ness greatness
print (df1.a.tolist())
['world', 'ness']
交叉加入的替代解决方案:
b = df.b.unique()
df['tmp'] = 1
df1 = pd.merge(df[['b','tmp']],df[['b','tmp']], on='tmp')
df1 = df1[df1.apply(lambda x: x.b_x in x.b_y, axis=1) & (df1.b_x != df1.b_y)]
print (df1)
b_x tmp b_y
1 world 1 helloworld
23 ness 1 greatness
print (df1.b_x.tolist())
['world', 'ness']
答案 1 :(得分:0)
这可能对您有用:
df_cross = pd.DataFrame(data=np.asarray(df.b) + " " + df.b[:,None], columns=df.b)
df_indicator = df_cross.applymap(lambda x: x.split()[0] in x.split()[1])
df_indicator.sum(axis=0)[lambda x: x>1].index
Out[231]: Index([u'world', u'ness'], dtype='object')
答案 2 :(得分:0)
我们可以创建一个真值数组,而行索引是列标题的子字符串。
l = df.b.dropna().values # grab values from b
# double comprehension
a = np.array([[j in i for i in l] for j in l])
# of course strings are sub-strings of themselves
# lets ignore them by making the diagonal `False`
np.fill_diagonal(a, False)
# find the indices where the array is `True`
i, j = np.where(a)
l[i].tolist()
['world', 'ness']
更好的imo
s = pd.Series(l[i], l[j])
s
helloworld world
greatness ness
dtype: object