我有数据框
member_id,event_type,event_path,event_time,event_date,event_duration
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,0
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,2
20077,2016-11-20,"2016-11-20 09:17:09",url,avito.ru/profile/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,1
20077,2016-11-20,"2016-11-20 09:17:37",url,avito.ru/auto/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,135
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,0
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,37
并有另一个df2
domain category subcategory unique id count_sec Main category Subcategory
avito.ru/auto Автомобили Авто 1600 83112396 Auto Avito
youtube.com Видеопортал Видеохостинг 1317 42710996 Video Youtube
ok.ru Развлечения Социальные сети 694 13394605 Social network OK
kinogo.club Развлечения Кино 497 8438800 Video Illegal
e.mail.ru Почтовый сервис None 1124 8428984 Mail.ru Email
vk.com/audio Видеопортал Видеохостинг 1020 7409440 Music VK
通常我使用:
df['category'] = df.event_date.map(df2.set_index('domain')['Main category']
但是它比较数据,如果它相等,它需要值并在新列中创建它。但是我怎么能这样做,但是如果在字符串中使用子字符串?
答案 0 :(得分:0)
我真的不知道你到底想要做什么。但我的建议是这样的:
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
num_imgs = 20
datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
img = load_img('data/train/cats/cat.0.jpg') # this is a PIL image
x = img_to_array(img) # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape) # this is a Numpy array with shape (1, 3, 150, 150)
# the .flow() command below generates batches of randomly transformed images
# and saves the results to the `preview/` directory
i = 0
for batch in datagen.flow(x, batch_size=1,
save_to_dir='preview', save_prefix='cat', save_format='jpeg'):
i += 1
if i > num_imgs:
break # otherwise the generator would loop indefinitely
测试df的子部分,因为它可能需要一段时间,具体取决于您拥有的数据量。
答案 1 :(得分:0)
如果没有任何启发式方法来发现要加入的模糊匹配项,您将无法获得可扩展的解决方案,因为您需要生成 O(N 2 )比较。
对于您的特定用例,我建议您提取做想要比较的网址部分。也许像是
from urlparse import urlparse
def netloc(s):
return urlparse('http://' + s).netloc
df['netloc'] = df['event_date'].apply(netloc)
df2['netloc'] = df2['domain'].apply(netloc)
df.merge(df2, 'left', on='netloc')