我在学期开始和结束时进行了一次多项选择调查,我想分析学生对问题的答案从头到尾是否发生了显着变化。
由于种种原因,有些学生会回答第一个,而不回答第二个,反之亦然。我想从分析中删除那些。
请注意,学生并非都在完全相同的时间(甚至是一天)回答问题。有些学生可能在作业的前一天或后一天回答问题,所以我不能依赖日期/时间。我必须依靠电子邮件地址的匹配。
问题通常是“强烈同意或不同意,同意或不同意或不确定。
我的数据文件如下:
Email address: text
Time: date/time
Multiple Choice Q1: [agree, disagree, neutral]
Multiple Choice Q2: [agree, disagree, neutral]
我玩过很多想法,但它们都是某种形式的蛮力老式循环和保存。
使用熊猫我怀疑有一种更优雅的方法。
这是输入的模型:
input = pd.DataFrame({'email':
['joe@sample.com', 'jane@sample.com', 'jack@sample.com',
'joe@sample.com', 'jane@sample.com', 'jack@sample.com', 'jerk@sample.com'],
'date': ['jan 1 2019', 'jan 2 2019', 'jan 1 2019',
'july 2, 2019', 'july 1 2019', 'july 1, 2019', 'july 1, 2019'],
'are you happy?': ["yes", "no", "no", "yes", "yes", "yes", "no"],
'are you smart?': ['no', 'no', 'no', 'yes', 'yes' , 'yes', 'yes']})
这是输出模型:
output = pd.DataFrame({'question': ['are you happy?', 'are you smart?'],
'change score': [+0.6, +1]})
变更分数的逻辑是“你开心吗?” Joe保持不变,而jack和jane从“否”变为“是”,因此(0 +1 + 1)/ 3。而对于“你聪明吗?”这三个人都从否定为是,所以(1 +1 +1)/ 3 =1。jerk@sample.com未被计入,因为他没有对开始的调查仅对结束的调查做出回应。
这是我的数据文件的前两行:
Timestamp,Email Address,How I see myself [I am comfortable in a leadership position],How I see myself [I like and am effective working in a team],How I see myself [I have a feel for business],How I see myself [I have a feel for marketing],How I see myself [I hope to start a company in the future],How I see myself [I like the idea of working at a large company with a global impact],"How I see myself [Carreerwise, I think working at a startup is very risky]","How I see myself [I prefer an unstructured, improvisational job]",How I see myself [I like to know exactly what is expected of me so I can excel],How I see myself [I've heard that I can make a lot of money in a startup and that is important to me so I can support myself and my family],How I see myself [I would never work at a significant company (like Google)],How I see myself [I definitely want to work at a significant company (like Facebook)],How I see myself [I have confidence in my intuitions about creating a successful business],How I see myself [The customer is always right],How I see myself [Don't ask users what they want: they don't know what they want],How I see myself [If you create what customers are asking for you will always be behind],"How I see myself [From the very start of designing a business, it is crucial to talk to users and customers]",What is your best guess of your career 3 years after your graduation?,Class,Year of expected graduation (undergrad or grad),"How I see myself [Imagine you've been working on a new product for months, then discover a competitor with a similar idea. The best response to this is to feel encouraged because this means that what you are working on is a real problem.]",How I see myself [Most startups fail],How I see myself [Row 20],"How I see myself [For an entrepreneur, Strategic skills are more important than having a great (people) network]","How I see myself [Strategic vision is crucial to success, so that one can consider what will happen several moves ahead]",How I see myself [It's important to stay focused on your studies rather than be dabbling in side projects or businesses],How I see myself [Row 23],How I see myself [Row 22]
8/30/2017 18:53:21,s@b.edu,I agree,Strongly agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I disagree,I disagree,I disagree,I disagree,I disagree,Strongly disagree,I agree,working with film production company,Sophomore,2020,,,,,,,,
答案 0 :(得分:1)
从您的初始数据帧开始,
首先,我们将您的日期转换为正确的日期时间。
df['date'] = pd.to_datetime(df['date'])
然后我们创建两个变量,第一个确保每个人的电子邮件计数超过2,第二个变量分别属于第1个月和第7个月。
(假设您可能具有重复的整数).loc
使我们可以在数据帧中使用布尔条件。
s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]
print(df2)
email date are you happy? are you smart?
0 joe@sample.com 2019-01-01 yes no
1 jane@sample.com 2019-01-02 no no
2 jack@sample.com 2019-01-01 no no
3 joe@sample.com 2019-07-02 yes yes
4 jane@sample.com 2019-07-01 yes yes
5 jack@sample.com 2019-07-01 yes yes
现在,我们需要重塑数据,以便我们可以更轻松地运行一些逻辑测试。
df3 = (
df2.set_index(["email", "date"])
.stack()
.reset_index()
.rename(columns={0: "answer", "level_2": "question"})
.sort_values(["email", "date"])
)
email date question answer
0 jack@sample.com 2019-01-01 are you happy? no
1 jack@sample.com 2019-01-01 are you smart? no
2 jack@sample.com 2019-07-01 are you happy? yes
3 jack@sample.com 2019-07-01 are you smart? yes
现在,我们需要弄清楚杰克的答案是否从学期开始到结束都发生了变化,如果是这样,我们将分配分数,我们将利用map
并从输出数据框中创建字典。
score_dict = dict(zip(output["question"], output["change score"]))
s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))
df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
score_dict
)
print(df3)
email date question answer score
4 jack@sample.com 2019-01-01 are you happy? no NaN
5 jack@sample.com 2019-01-01 are you smart? no NaN
10 jack@sample.com 2019-07-01 are you happy? yes 0.6
11 jack@sample.com 2019-07-01 are you smart? yes 1.0
2 jane@sample.com 2019-01-02 are you happy? no NaN
3 jane@sample.com 2019-01-02 are you smart? no NaN
8 jane@sample.com 2019-07-01 are you happy? yes 0.6
9 jane@sample.com 2019-07-01 are you smart? yes 1.0
0 joe@sample.com 2019-01-01 are you happy? yes NaN
1 joe@sample.com 2019-01-01 are you smart? no NaN
6 joe@sample.com 2019-07-02 are you happy? yes NaN
7 joe@sample.com 2019-07-02 are you smart? yes 1.0
从逻辑上讲,我们只想将分数应用于任何已更改且不是倒数第二个月的值。
因此,乔在第一个学期选择“是”,在第二个学期选择“是”,因此他的are you happy
问题的值为NaN。
您可能想为评分添加更多逻辑,以不同方式查看Y / N,并且您需要从查看第一行开始清理数据框-但遵循这些原则的方法应该可行。