结果从for循环到数据帧,然后到csv

时间:2019-06-06 18:59:09

标签: python python-3.x pandas

我有学校和所提供课程的清单。我还列出了一些独特的课程,其中各学校仅提供一些课程,有些则没有。我创建了一个循环,为每个学校输出缺少的班级以及该学校的名称,但是我无法将for循环的全部结果输出到csv。

我已经能够将一所学校的课程写到csv,但是我无法将包括所有学校的for循环的整个结果写到csv。

我知道我需要将for循环的结果插入到数据帧中。下一步将是遍历数据帧并将结果逐行发送到csv,但是我首先需要将结果从for循环中获取到数据帧中。

读入数据框

schools = {'School': ['School A', 'School A', 'School A', 'School B', 'School B', 'School B', 'School C','School C', 'School D'], 'Class': ['Math', 'Chemistry', 'English', 'Math', 'Chemistry', 'English', 'Math', 'Chemistry', 'Physics']}
dfSchool = pd.DataFrame(data=schools)
dfSchool


classes = {'Class': ['Math', 'Chemistry', 'English', 'History', 'Physics']}
dfClasses = pd.DataFrame(data=classes)
dfClasses

用于循环

grouped = dfSchool.groupby('School')

for name, group in grouped:
    print(name)
    print(dfClasses[~(dfClasses.Class.isin(group["Class"]))])

将for循环的结果放入数据框(此代码无效)

listFinal = []
for name, group in grouped:
    print(name)
    print(dfClasses[~(dfClasses.Class.isin(group["Class"]))])
    listFinal.append(name)
    listFinal.append(dfClasses[~(dfClasses.Class.isin(group["Class"]))])

dfOutput = pd.DataFrame(listFinal)
dfOutput.to_csv('SchoolClasses.csv', index=True)

实际结果: 控制台包含以下输出,但是当写入csv时,我在文件中仅获得学校A。我希望将下面的所有输出(所有学校)都写入csv文件。

School A
     Class
3  History
4  Physics
School B
     Class
3  History
4  Physics
School C
     Class
2  English
3  History
4  Physics
School D
       Class
0       Math
1  Chemistry
2    English
3    History

所需结果: 上面的输出,但是在单个csv文件中。如果您可以将学校名称放在其相应班级的每一行中,而不仅仅是将学校名称作为标题,则可以加分。

当尝试将for循环的结果放入数据帧时,我得到:

listFinal

['School A',      Class
 3  History
 4  Physics, 'School B',      Class
 3  History
 4  Physics, 'School C',      Class
 2  English
 3  History
 4  Physics, 'School D',        Class
 0       Math
 1  Chemistry
 2    English
 3    History]

4 个答案:

答案 0 :(得分:1)

创建学校数据框:

schools = {
    "School": [
        "School A",
        "School A",
        "School A",
        "School B",
        "School B",
        "School B",
        "School C",
        "School C",
        "School D",
    ],
    "Class": [
        "Math",
        "Chemistry",
        "English",
        "Math",
        "Chemistry",
        "English",
        "Math",
        "Chemistry",
        "Physics",
    ],
}
dfSchool = pd.DataFrame(data=schools)
print(dfSchool)

     School      Class
0  School A       Math
1  School A  Chemistry
2  School A    English
3  School B       Math
4  School B  Chemistry
5  School B    English
6  School C       Math
7  School C  Chemistry
8  School D    Physics

创建一个数据框,以显示所有学校都有所有班级的情况。称为df_tot

s = ['School A'] * len(c) + ['School B']* len(c) + ['School C']* len(c) + ['School D']* len(c)
c = ['Math', 'Chemistry', 'English', 'History', 'Physics']

df_tot = pd.DataFrame([s, c*4], index=['School','Class']).T

print(df_tot)

     School      Class
0   School A       Math
1   School A  Chemistry
2   School A    English
3   School A    History
4   School A    Physics
5   School B       Math
6   School B  Chemistry
7   School B    English
8   School B    History
9   School B    Physics
10  School C       Math
11  School C  Chemistry
12  School C    English
13  School C    History
14  School C    Physics
15  School D       Math
16  School D  Chemistry
17  School D    English
18  School D    History
19  School D    Physics

进行外部合并,然后将指示器选择为True,然后过滤_merge == left_only。

df_tot = df_tot[df_tot.merge(dfSchool, how='outer', indicator=True)['_merge'] == 'left_only'])

print(df_tot)

      School      Class
3   School A    History
4   School A    Physics
8   School B    History
9   School B    Physics
12  School C    English
13  School C    History
14  School C    Physics
15  School D       Math
16  School D  Chemistry
17  School D    English
18  School D    History

保存到csv ...

df_tot.to_csv('anyfile.csv')

数据框的替代答案

我想知道使用字典和json是否不仅容易?

School = [
    "School A",
    "School A",
    "School A",
    "School B",
    "School B",
    "School B",
    "School C",
    "School C",
    "School D",
]

Class = [
    "Math",
    "Chemistry",
    "English",
    "Math",
    "Chemistry",
    "English",
    "Math",
    "Chemistry",
    "Physics",
]

列出学校中现有的课程。

A = list(zip(School, Class))

for item in A:
    print(item)

('School A', 'Math')
('School A', 'Chemistry')
('School A', 'English')
('School B', 'Math')
('School B', 'Chemistry')
('School B', 'English')
('School C', 'Math')
('School C', 'Chemistry')
('School D', 'Physics')

将其放入一个必填项:

d1 = {}
for item in A:
    d1.setdefault(item[0], []).append(item[1])

print(d1)

{'School A': ['Math', 'Chemistry', 'English'],
 'School B': ['Math', 'Chemistry', 'English'],
 'School C': ['Math', 'Chemistry'],
 'School D': ['Physics']}

使用不在d1中的项目构建新词典:

d2 = {}
for s in set(School):  
    for c in set(Class):
        if c in d1[s]:
            continue
        else:
            d2.setdefault(s,[]).append(c)


print(d2)

{'School C': ['Physics', 'English'],
 'School A': ['Physics'],
 'School B': ['Physics'],
 'School D': ['Math', 'Chemistry', 'English']}

然后我会考虑使用json文件:

import json

with open('data.json', 'w') as fp:
    json.dump(d2, fp)

答案 1 :(得分:1)

以下代码将每所学校的所有缺失班级汇总为一组。

schools = {'School': ['School A', 'School A', 'School A', 'School B', 'School B', 'School B', 'School C','School C', 'School D'], 'Class': ['Math', 'Chemistry', 'English', 'Math', 'Chemistry', 'English', 'Math', 'Chemistry', 'Physics']}
dfSchool = pd.DataFrame(schools)

classes = {'Class': ['Math', 'Chemistry', 'English', 'History', 'Physics']}

set_classes = set(classes["Class"])
df = dfSchool.groupby('School').agg(lambda c: set_classes.difference(c))
df.name = "MissingClasses"
df.to_csv("SchoolClasses.csv")

答案 2 :(得分:1)

这只是对如何将已打印的内容输出到csv文件的直接答案。因此,我保留了您的算法,仅稍微更改了listFinal列表的内容:

listFinal = []
for name, group in grouped:
    print(name)
    print(dfClasses[~(dfClasses.Class.isin(group["Class"]))])
    # add a new column with the class name to the dataframe appended to the list
    listFinal.append(dfClasses[~(dfClasses.Class.isin(group["Class"]))]
                     .assign(School=name))

然后我们可以使用简单的pd.concat轻松地将所有内容输出到csv文件:

dfOutput = pd.concat(listFinal)
dfOutput.to_csv('SchoolClasses.csv', index=True)

答案 3 :(得分:1)

一种选择是使用pandas.DataFrame.groupby.apply

import pandas as pd


schools = {'School': ['School A', 'School A', 'School A', 
                      'School B', 'School B', 'School B',
                      'School C', 'School C', 'School D'],
           'Class': ['Math', 'Chemistry', 'English',
                     'Math', 'Chemistry', 'English',
                     'Math', 'Chemistry', 'Physics']
           }

classes = {'Class': ['Math', 'Chemistry', 'English', 'History', 'Physics']}

df_school = pd.DataFrame(data=schools)
df_classes = pd.DataFrame(data=classes)

missing = (df_school.groupby('School')
                    .apply(lambda group: df_classes[~(df_classes["Class"].isin(group["Class"]))])
                    .droplevel(-1)
                    )
missing.to_csv("missing_classes.csv")

结果:

>>> missing
              Class
School             
School A    History
School A    Physics
School B    History
School B    Physics
School C    English
School C    History
School C    Physics
School D       Math
School D  Chemistry
School D    English
School D    History

missing_classes.csv

  

学校,班级
  学校A,历史
  A学校,物理
  学校B,历史
  B学校,物理
  C学校,英语
  学校C,历史
  C学校,物理
  D学校,数学
  化学D学院
  D学校,英语
  学校D,历史