列和行之间的Python交互

时间:2017-05-09 09:03:19

标签: python pandas

我有以下数据框:

      topic  student level week
        1      a       1     1
        1      b       2     1
        1      a       3     1
        2      a       1     2
        2      b       2     2
        2      a       3     2
        2      b       4     2
        3      c       1     2
        3      b       2     2
        3      c       3     2
        3      a       4     2
        3      b       5     2

它包含一个列级别,用于指定启动主题的人员以及回复该人员的人员。如果学生的等级为1,则表示他提出了问题。如果学生的水平为2,则表示他回复了提出问题的学生。如果学生的水平为3,则表示他回答了等级为2且持续不断的学生。

我想提取一个新的数据框,该数据框应该通过PER WEEK主题在学生之间进行交流。它应包含五列:"学生来源","学生目的地","周","总主题"和#34;回复计数"。

我应该得到类似的东西:

    st_source st_dest  week  total_topics  reply_count
        a        b       1        1             1
        a        b       2        2             1
        a        c       2        1             0
        b        a       1        1             0
        b        a       2        2             0
        b        c       2        1             0
        c        a       2        1             0
        c        b       2        1             1

学生目的地是每个学生与之分享主题的学生。

总主题是与其他学生共享的多个主题。我发现它使用以下代码:

idx_cols = ['topic', 'week']
std_cols = ['student_x', 'student_y']
d1 = df.merge(df, on=idx_cols)
d2 = d1.loc[d1.student_x != d1.student_y, idx_cols + std_cols]

d2.loc[:, std_cols] = np.sort(d2.loc[:, std_cols])

d3 = d2.drop_duplicates().groupby(
    std_cols + ['week']).size().reset_index(name='count')
d3.columns = ['st_source', 'st_dest', 'week', 'total_topics']

我很难找到最后一栏"回复计数"。

回复计数是学生目的地"直接"的次数。回复学生来源。如果学生A开始主题(通过在1级发送消息),B回答A(在2级发送消息),所以B直接回答A.考虑"直接" B和A 的回复当且仅当 B在同一主题中将级别k回复到级别为k-1的A的消息时。只有学生'从第2级到第1级的回复。

有人有什么建议吗?

请告诉我是否应该更好地解释一下。

谢谢!

1 个答案:

答案 0 :(得分:1)

我的建议:

我会使用一个字典,其中包含' source-destination-week'作为键和(total_topics,reply_counts)作为值。

循环遍历第一个数据帧,对于每个问题,将第一条消息发布为目的地的商店,将第二条消息作为来源发布的商店,将周存储为星期,在密钥“源” - 目的地的字典中添加计数器周的&#39 ;.我注意到你不再需要显示没有互动的学生对,因此我删除了它。 例如:

from itertools import permutations

results = {}  # the dictionary where results is going to be stored
source = False  # a simple boolean to make sure message 2 follows message 1
prev_topic = None  # boolean to detect topic change
topic_users = set()  # set containing the curent users of the topic
prev_week = None  # variable to check if week is constant in topic.

for row in dataframe:  # iterate over the dataframe

    if prev_topic = row[0]:  # if we are on the same topic

        if row[2] == 1:  # if it is an initial message
            source = row[1]  # we store users as source
            topic_users.add(source)  # add the user to the topic's set of users
            week = row[3]  # we store the week

        elif row[2] == 2 and source:  # if this is a second message
            destination = row[1]  # store user as destination
            topic_users.add(destination)  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
                # break  # uncomment the line to exit the for loop if error is met

            key = "-".join((source, destination, week))  # construct a key based on source/destination/week
            if key not in results:  # if the key is new to dictionary
                results[key] = [0, 0]  # create the new entry as a list containing topic_counts, reply_counts

            results[key][1] += 1  # add a counter to the reply_counts
            source = False  # reset destination

        else:
            topic_user.add(row[1])  # add the user to the topic's set of users
            if week != row[3]:  # if the week differs, we print a message
                print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
                # break  # uncomment the line to exit the for loop if error is met

            source = False  # reset destination

    elif prev_topic != None:  # if we enconter a new topic (and not the first one)
        for pair in permutations(topic_users, 2):
            key = "-".join(pair) + "-" + week  # construct a key based on source/destination/week
            if key not in results:   # if the key is new to dictionary
                results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
            else:  # otherwise
                results[key][0] += 1  # add a counter to the topic_counts

        topic_users = set()

    prev_topic = row[0]

# redo the topic count feeding for the last topic (for wich we didn't detect a change of topic)
if len(topic_users) > 0: 
    for pair in permutations(topic_users, 2):
        key = "-".join(pair) + "-" + week  # construct a key based on source/destination/week
        if key not in results:   # if the key is new to dictionary
            results[key] = [1, 0]  # create the new entry as a list containing topic_counts, reply_counts
        else:  # otherwise
            results[key][0] += 1  # add a counter to the topic_counts

然后您可以将字典转换回数据帧。 例如:

dico = {'b-a': [0,1], 'b-c' : [1,1], 'a-b': [2,1]}
df = pd.DataFrame.from_dict(dico, orient='index')
df.rename(index="str", columns={0:'topic', 1:'reply'})

我希望我没有在代码中输入任何拼写错误,无法对其进行测试......随时可以提出任何问题:)