Question

我有一系列活动 popular_activities ，如下所示：

['google.co.uk', 'whatsapp', 'sharelatex.com', 'Financial Times', 'other',                 
'en.wikipedia.org', 'Instagram for Android', 'YouTube for Android', 
'arxiv-sanity.com', 'quora.com', 'microsoft word', 'Inbox by Gmail', 
'Google Chrome for Android', 'youtube.com', 'mendeley desktop', 
'web.whatsapp.com', 'Preview', 'texshop', 'Google Now', 'mobile - 
com.compassnews.app', 'netflix.com', 'WhatsApp Messenger Android', 'Facebook 
for Android', 'arxiv.org']

我还有一个DataFrame如下：

                       Activity                           Time Spent (seconds)
Date                                                                       
2017-03-25T00:05:00    [netflix.com, other, Google Now]   [30, 6, 2]
2017-03-25T00:10:00    [netflix.com]                      [300]
2017-03-25T00:15:00    [netflix.com]                      [102]   
2017-03-25T00:30:00    [netflix.com]                      [232]   
2017-03-25T00:35:00    [netflix.com]                      [279]

我想在此DataFrame＆＃39; Activity_vector＆＃39;中创建一个新列。这样，该列中的每个元素都是一个长度等于 popular_activities 的向量，其中相应的活动索引（如 popular_activities 数组中所示）包含所花费的时间关于那项活动。

因此，例如，对于Date：2017-03-25T00：05：00的第二个元素，新列中的相应元素＆＃39; Activity_vector＆＃39;会是

[0, 0, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 30, 0, 0, 0]

这样＆＃39;活动＆＃39; netflix.com，其中包含相应的“Time Spent（秒）＆＃39; popular_activities 中的值30和索引21将具有索引21填充值30的数组，＆＃39;活动＆＃39;其他人会将其相应的索引（4）填充为值6，对Google Now也是如此。

我所拥有的片段如下，其中self.clean_df是所讨论的DataFrame：

class Clean_DF(object):
....
    def clean_data(self, time_percentage):
    ....
        self.clean_df['Activity_vector'] = self.clean_df.apply(lambda x: self.activity_to_vector(x),axis=1)

    def activity_to_vector(self, row):
        vect = np.zeros(len(self.popular_apps))
        for x,y in zip(row['Activity'], row['Time Spent (seconds)']):
            vect[self.popular_apps_dict[x]] += vect[self.popular_apps_dict[x]] + y
        return vect

然而，当我运行此操作时，我收到以下错误

ValueError: Shape of passed values is (3862, 24), indices imply (3862, 2)

如何解决此错误/编写解决我问题的函数？

Answer 1

num = len(popular_activities)

def make_array(s):
    z = [0] * num
    time = s['Time Spent (seconds)']
    for i, val in enumerate(s['Activity']):
        idx = popular_activity.index(val)
        z[idx] = time[i]
    return z

df.apply(make_array, axis=1)

如何从数据框中的其他列创建新的Pandas数据框列

1 个答案: