我有以下数据框:
topic student level week
1 a 1 1
1 b 2 1
1 a 3 1
2 a 1 2
2 b 2 2
2 a 3 2
2 b 4 2
3 c 1 2
3 b 2 2
3 c 3 2
3 a 4 2
3 b 5 2
它包含一个列级别,用于指定启动主题的人员以及回复该人员的人员。如果学生的等级为1,则表示他提出了问题。如果学生的水平为2,则表示他回复了提出问题的学生。如果学生的水平为3,则表示他回答了等级为2且持续不断的学生。
我想提取一个新的数据框,该数据框应该通过PER WEEK主题在学生之间进行交流。它应包含五列:"学生来源","学生目的地","周","总主题"和#34;回复计数"。
我应该得到类似的东西:
st_source st_dest week total_topics reply_count
a b 1 1 1
a b 2 2 1
a c 2 1 0
b a 1 1 0
b a 2 2 0
b c 2 1 0
c a 2 1 0
c b 2 1 1
学生目的地是每个学生与之分享主题的学生。
总主题是与其他学生共享的多个主题。我发现它使用以下代码:
idx_cols = ['topic', 'week']
std_cols = ['student_x', 'student_y']
d1 = df.merge(df, on=idx_cols)
d2 = d1.loc[d1.student_x != d1.student_y, idx_cols + std_cols]
d2.loc[:, std_cols] = np.sort(d2.loc[:, std_cols])
d3 = d2.drop_duplicates().groupby(
std_cols + ['week']).size().reset_index(name='count')
d3.columns = ['st_source', 'st_dest', 'week', 'total_topics']
我很难找到最后一栏"回复计数"。
回复计数是学生目的地"直接"的次数。回复学生来源。如果学生A开始主题(通过在1级发送消息),B回答A(在2级发送消息),所以B直接回答A.考虑"直接" B和A 的回复当且仅当 B在同一主题中将级别k回复到级别为k-1的A的消息时。只有学生'从第2级到第1级的回复。
有人有什么建议吗?
请告诉我是否应该更好地解释一下。
谢谢!
答案 0 :(得分:1)
我的建议:
我会使用一个字典,其中包含' source-destination-week'作为键和(total_topics,reply_counts)作为值。
循环遍历第一个数据帧,对于每个问题,将第一条消息发布为目的地的商店,将第二条消息作为来源发布的商店,将周存储为星期,在密钥“源” - 目的地的字典中添加计数器周的&#39 ;.我注意到你不再需要显示没有互动的学生对,因此我删除了它。 例如:
from itertools import permutations
results = {} # the dictionary where results is going to be stored
source = False # a simple boolean to make sure message 2 follows message 1
prev_topic = None # boolean to detect topic change
topic_users = set() # set containing the curent users of the topic
prev_week = None # variable to check if week is constant in topic.
for row in dataframe: # iterate over the dataframe
if prev_topic = row[0]: # if we are on the same topic
if row[2] == 1: # if it is an initial message
source = row[1] # we store users as source
topic_users.add(source) # add the user to the topic's set of users
week = row[3] # we store the week
elif row[2] == 2 and source: # if this is a second message
destination = row[1] # store user as destination
topic_users.add(destination) # add the user to the topic's set of users
if week != row[3]: # if the week differs, we print a message
print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
# break # uncomment the line to exit the for loop if error is met
key = "-".join((source, destination, week)) # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [0, 0] # create the new entry as a list containing topic_counts, reply_counts
results[key][1] += 1 # add a counter to the reply_counts
source = False # reset destination
else:
topic_user.add(row[1]) # add the user to the topic's set of users
if week != row[3]: # if the week differs, we print a message
print "ERROR: Topic " + str(row[0]) + " extends on several weeks"
# break # uncomment the line to exit the for loop if error is met
source = False # reset destination
elif prev_topic != None: # if we enconter a new topic (and not the first one)
for pair in permutations(topic_users, 2):
key = "-".join(pair) + "-" + week # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [1, 0] # create the new entry as a list containing topic_counts, reply_counts
else: # otherwise
results[key][0] += 1 # add a counter to the topic_counts
topic_users = set()
prev_topic = row[0]
# redo the topic count feeding for the last topic (for wich we didn't detect a change of topic)
if len(topic_users) > 0:
for pair in permutations(topic_users, 2):
key = "-".join(pair) + "-" + week # construct a key based on source/destination/week
if key not in results: # if the key is new to dictionary
results[key] = [1, 0] # create the new entry as a list containing topic_counts, reply_counts
else: # otherwise
results[key][0] += 1 # add a counter to the topic_counts
然后您可以将字典转换回数据帧。 例如:
dico = {'b-a': [0,1], 'b-c' : [1,1], 'a-b': [2,1]}
df = pd.DataFrame.from_dict(dico, orient='index')
df.rename(index="str", columns={0:'topic', 1:'reply'})
我希望我没有在代码中输入任何拼写错误,无法对其进行测试......随时可以提出任何问题:)