大熊猫根据另一列中的条件从一列中提取公共子字符串

时间:2019-10-11 10:06:14

标签: pandas

我有一个看起来像这样的数据框。

from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense

classifier = Sequential()

classifier.add(Conv2D(32, (3, 3), input_shape = (100, 100, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))

classifier.add(Flatten())

classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 10, activation = 'softmax'))

classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

from keras.preprocessing.image import ImageDataGenerator

train_datagen = ImageDataGenerator(rescale = 1./255,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True)

test_datagen = ImageDataGenerator(rescale = 1./255)

training_set = train_datagen.flow_from_directory('sgn_dataset/train_set',
                                                 target_size = (100,100),
                                                 batch_size = 32,
                                                 class_mode = 'categorical')

test_set = test_datagen.flow_from_directory('sgn_dataset/test_set',
                                            target_size = (100, 100),
                                            batch_size = 32,
                                            class_mode = 'categorical')

classifier.fit_generator(training_set,
                         steps_per_epoch = 1534,
                         epochs = 15,
                         validation_data = test_set,
                         validation_steps = 548)

我想根据相似的“代码”列将“描述”简化为通用子字符串。并放置重复项。

code    description          col3        col4
123456  nice shoes size4     something   something
123456  nice shoes size5     something   something
567890  boots size 1         something   something
567890  boots size 2         something   something
567890  boots size 3         something   something
234567 baby overall 2yrs     something.  something
234567 baby overall 3-4yrs     something  something
456778 shirt m     Something.   Something
456778 shirt l     something    Something
456778 shirt xl    Something   Something

我怀疑需要分组,也许可以应用一个功能,但无法解决这个问题。 找到了一个函数,但是需要2个字符串。不确定是否 可能会有所帮助。而且此函数只需要2个字符串,而我的数据可能有5行具有相同的代码...

code    description          col3        col4
123456  nice shoes          something   something
567890  boots               something   something
234567 baby overall    something    something
456778 shirt              Something   Something

感谢所有提供的帮助。

1 个答案:

答案 0 :(得分:0)

您需要熊猫 0.25.1 才能使用explode

mask=(df.groupby('code')['code'].transform('size')>1)
df1=df[mask]
df2=df[~mask]
s=df1.groupby('code',sort=False)['description'].apply(lambda x: ' '.join(x).split(' ')).explode()
s_not_duplicates=s.to_frame()[s.map(s.value_counts()>1)].drop_duplicates().groupby(level=0)['description'].apply(lambda x: ' '.join(x))
description_not_duplicates=pd.concat([s_not_duplicates,df2.description])
print(description_not_duplicates)

123456      nice shoes
234567    baby overall
456778           shirt
567890      boots size
Name: description, dtype: object