我有一个看起来像这样的数据框。
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
classifier = Sequential()
classifier.add(Conv2D(32, (3, 3), input_shape = (100, 100, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Conv2D(32, (3, 3), activation = 'relu'))
classifier.add(MaxPooling2D(pool_size = (2, 2)))
classifier.add(Flatten())
classifier.add(Dense(units = 128, activation = 'relu'))
classifier.add(Dense(units = 10, activation = 'softmax'))
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale = 1./255,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale = 1./255)
training_set = train_datagen.flow_from_directory('sgn_dataset/train_set',
target_size = (100,100),
batch_size = 32,
class_mode = 'categorical')
test_set = test_datagen.flow_from_directory('sgn_dataset/test_set',
target_size = (100, 100),
batch_size = 32,
class_mode = 'categorical')
classifier.fit_generator(training_set,
steps_per_epoch = 1534,
epochs = 15,
validation_data = test_set,
validation_steps = 548)
我想根据相似的“代码”列将“描述”简化为通用子字符串。并放置重复项。
code description col3 col4
123456 nice shoes size4 something something
123456 nice shoes size5 something something
567890 boots size 1 something something
567890 boots size 2 something something
567890 boots size 3 something something
234567 baby overall 2yrs something. something
234567 baby overall 3-4yrs something something
456778 shirt m Something. Something
456778 shirt l something Something
456778 shirt xl Something Something
我怀疑需要分组,也许可以应用一个功能,但无法解决这个问题。 找到了一个函数,但是需要2个字符串。不确定是否 可能会有所帮助。而且此函数只需要2个字符串,而我的数据可能有5行具有相同的代码...
code description col3 col4
123456 nice shoes something something
567890 boots something something
234567 baby overall something something
456778 shirt Something Something
感谢所有提供的帮助。
答案 0 :(得分:0)
您需要熊猫 0.25.1 才能使用explode
mask=(df.groupby('code')['code'].transform('size')>1)
df1=df[mask]
df2=df[~mask]
s=df1.groupby('code',sort=False)['description'].apply(lambda x: ' '.join(x).split(' ')).explode()
s_not_duplicates=s.to_frame()[s.map(s.value_counts()>1)].drop_duplicates().groupby(level=0)['description'].apply(lambda x: ' '.join(x))
description_not_duplicates=pd.concat([s_not_duplicates,df2.description])
print(description_not_duplicates)
123456 nice shoes
234567 baby overall
456778 shirt
567890 boots size
Name: description, dtype: object