使用熊猫识别匹配记录对,以进行进一步分析

时间:2019-12-29 15:36:05

标签: pandas

我在学期开始和结束时进行了一次多项选择调查,我想分析学生对问题的答案从头到尾是否发生了显着变化。

由于种种原因,有些学生会回答第一个,而不回答第二个,反之亦然。我想从分析中删除那些。

请注意,学生并非都在完全相同的时间(甚至是一天)回答问题。有些学生可能在作业的前一天或后一天回答问题,所以我不能依赖日期/时间。我必须依靠电子邮件地址的匹配。

问题通常是“强烈同意或不同意,同意或不同意或不确定。

我的数据文件如下:

Email address: text
Time: date/time
Multiple Choice Q1: [agree, disagree, neutral]
Multiple Choice Q2: [agree, disagree, neutral]
  1. 我需要过滤出没有回答两次的学生的记录(在学期开始和结束时)
  2. 我需要想出一种方法来量化每个答案的变化量。

我玩过很多想法,但它们都是某种形式的蛮力老式循环和保存。

使用熊猫我怀疑有一种更优雅的方法。


这是输入的模型:

input = pd.DataFrame({'email': 
                   ['joe@sample.com', 'jane@sample.com', 'jack@sample.com', 
                    'joe@sample.com', 'jane@sample.com', 'jack@sample.com', 'jerk@sample.com'],
                  'date': ['jan 1 2019', 'jan 2 2019', 'jan 1 2019',
                           'july 2, 2019', 'july 1 2019', 'july 1, 2019', 'july 1, 2019'],
                  'are you happy?': ["yes", "no", "no", "yes", "yes", "yes", "no"],
                  'are you smart?': ['no', 'no', 'no', 'yes', 'yes' , 'yes', 'yes']})

这是输出模型:

output = pd.DataFrame({'question': ['are you happy?', 'are you smart?'],
                       'change score': [+0.6, +1]})

多棒的锻炼,谢谢你的建议。

变更分数的逻辑是“你开心吗?” Joe保持不变,而jack和jane从“否”变为“是”,因此(0 +1 + 1)/ 3。而对于“你聪明吗?”这三个人都从否定为是,所以(1 +1 +1)/ 3 =1。jerk@sample.com未被计入,因为他没有对开始的调查仅对结束的调查做出回应。


这是我的数据文件的前两行:

Timestamp,Email Address,How I see myself [I am comfortable in a leadership position],How I see myself [I like and am effective working in a team],How I see myself [I have a feel for business],How I see myself [I have a feel for marketing],How I see myself [I hope to start a company in the future],How I see myself [I like the idea of working at a large company with a global impact],"How I see myself [Carreerwise, I think working at a startup is very risky]","How I see myself [I prefer an unstructured, improvisational job]",How I see myself [I like to know exactly what is expected of me so I can excel],How I see myself [I've heard that I can make a lot of money in a startup and that is important to me so I can support myself and my family],How I see myself [I would never work at a significant company (like Google)],How I see myself [I definitely want to work at a significant company (like Facebook)],How I see myself [I have confidence in my intuitions about creating a successful business],How I see myself [The customer is always right],How I see myself [Don't ask users what they want: they don't know what they want],How I see myself [If you create what customers are asking for you will always be behind],"How I see myself [From the very start of designing a business, it is crucial to talk to users and customers]",What is your best guess of your career 3 years after your graduation?,Class,Year of expected graduation (undergrad or grad),"How I see myself [Imagine you've been working on a new product for months, then discover a competitor with a  similar idea.  The best response to this is to feel encouraged because this means that what you are working on is a real problem.]",How I see myself [Most startups fail],How I see myself [Row 20],"How I see myself [For an entrepreneur, Strategic skills are more important than  having a great (people) network]","How I see myself [Strategic vision is crucial to success, so that one can consider what will happen several moves ahead]",How I see myself [It's important to stay focused on your studies rather than be dabbling in side projects or businesses],How I see myself [Row 23],How I see myself [Row 22]
8/30/2017 18:53:21,s@b.edu,I agree,Strongly agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I disagree,I disagree,I disagree,I disagree,I disagree,Strongly disagree,I agree,working with  film production company,Sophomore,2020,,,,,,,,

1 个答案:

答案 0 :(得分:1)

从您的初始数据帧开始,

首先,我们将您的日期转换为正确的日期时间。

df['date'] = pd.to_datetime(df['date'])

然后我们创建两个变量,第一个确保每个人的电子邮件计数超过2,第二个变量分别属于第1个月和第7个月。

(假设您可能具有重复的整数).loc使我们可以在数据帧中使用布尔条件。

s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]

print(df2)
    email       date            are you happy? are you smart?
0   joe@sample.com 2019-01-01            yes             no
1  jane@sample.com 2019-01-02             no             no
2  jack@sample.com 2019-01-01             no             no
3   joe@sample.com 2019-07-02            yes            yes
4  jane@sample.com 2019-07-01            yes            yes
5  jack@sample.com 2019-07-01            yes            yes

现在,我们需要重塑数据,以便我们可以更轻松地运行一些逻辑测试。

df3 = (
    df2.set_index(["email", "date"])
    .stack()
    .reset_index()
    .rename(columns={0: "answer", "level_2": "question"})
    .sort_values(["email", "date"])
)

             email       date        question answer  
0  jack@sample.com 2019-01-01  are you happy?     no    
1  jack@sample.com 2019-01-01  are you smart?     no    
2  jack@sample.com 2019-07-01  are you happy?    yes    
3  jack@sample.com 2019-07-01  are you smart?    yes    

现在,我们需要弄清楚杰克的答案是否从学期开始到结束都发生了变化,如果是这样,我们将分配分数,我们将利用map并从输出数据框中创建字典。

score_dict = dict(zip(output["question"], output["change score"]))

s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))

df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
    score_dict
)

print(df3)

              email       date        question answer  score
4   jack@sample.com 2019-01-01  are you happy?     no    NaN
5   jack@sample.com 2019-01-01  are you smart?     no    NaN
10  jack@sample.com 2019-07-01  are you happy?    yes    0.6
11  jack@sample.com 2019-07-01  are you smart?    yes    1.0
2   jane@sample.com 2019-01-02  are you happy?     no    NaN
3   jane@sample.com 2019-01-02  are you smart?     no    NaN
8   jane@sample.com 2019-07-01  are you happy?    yes    0.6
9   jane@sample.com 2019-07-01  are you smart?    yes    1.0
0    joe@sample.com 2019-01-01  are you happy?    yes    NaN
1    joe@sample.com 2019-01-01  are you smart?     no    NaN
6    joe@sample.com 2019-07-02  are you happy?    yes    NaN
7    joe@sample.com 2019-07-02  are you smart?    yes    1.0

从逻辑上讲,我们只想将分数应用于任何已更改且不是倒数第二个月的值。

因此,乔在第一个学期选择“是”,在第二个学期选择“是”,因此他的are you happy问题的值为NaN。

您可能想为评分添加更多逻辑,以不同方式查看Y / N,并且您需要从查看第一行开始清理数据框-但遵循这些原则的方法应该可行。