数据转换

Question

我正在过滤格式化为excel文件的外部数据源。我无法改变文件的生成方式。我需要过滤掉无用的行并将成对的行组合成一行。到目前为止，我的过程都是关于过滤的，而不是将两个连续行中的相关数据连接成一行。

数据帧不能很好地转换为stackoverflow，但我已经在下面手动调整了它们。

数据转换

将下载转换为有用的格式

import pandas as pd
from pandas          import DataFrame
from pandas.io.excel import read_excel
cpath = os.path.join (download_path, classes_report)
print (pd.__version__)

df = pd.read_excel (cpath, sheetname=0, header=None)
df.to_string()

0.14.1

0 1 2 3 4 5 0 Session: 2014-2015 NaN NaN NaN NaN NaN 1 Class Information Age Enrolled Key Room NaN 2 Math 10 12 / 18 03396 110 09:00:00 3 Teacher: Joe M Teacher NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN 5 NaN NaN NaN NaN 6 NaN NaN NaN NaN 7 NaN NaN NaN NaN NaN NaN 8 NaN NaN NaN NaN NaN NaN 9 Number of Classes: 1 Number of Students: 12 / 18 NaN NaN NaN NaN 10 Class Information Ages Enrolled Key Room NaN 11 Art 18 - 80 3 / 24 03330 110 10:00:00 12 Teacher: John A Instructor NaN NaN NaN NaN 13 NaN NaN NaN NaN NaN NaN 14 NaN NaN NaN NaN 15 NaN NaN NaN NaN

# Eliminate any rows where first column is NaN, contains 'Number of Classes', 'Class Information'
# or is blank
# The 5th column is tuition.

cf = df[df[0].notnull ()][1:]
cf = cf [~cf[0].str.contains ('Number of Classes')]
bf = cf[~cf[0].isin ([' ', 'Class Information'])]
bf.to_string()

0 1 2 3 4 5 2 Math 10 12 / 18 03396 110 09:00:00 3 Teacher: Joe M Teacher NaN NaN NaN NaN 11 Art 18 - 80 3 / 24 03330 110 10:00:00 12 Teacher: John A Instructor NaN NaN NaN NaN

left  = DataFrame(bf.values [::2], index=bf.index[::2])
right = DataFrame(bf.values [1::2], index=bf.index[1::2])
pd.concat([left, right], axis=1).to_string ()

0 1 2 3 4 5 0 1 2 3 4 5 2 Math 10 12 / 18 03396 110 09:00:00 NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN Teacher: Joe M Teacher NaN NaN NaN NaN 11 Art 18 - 80 3 / 24 03330 110 10:00:00 NaN NaN NaN NaN NaN NaN 12 NaN NaN NaN NaN NaN NaN Teacher: John A Instructor NaN NaN NaN NaN

这里的目标是拥有＆＃34;数学＆＃34;的最后五列。行包含以＆＃34开头的列;教师：＆＃34;，类似于＆＃34; Art＆＃34;行，留下一行有两行而不是四行。

Answer 1

您尝试concat按索引对齐2 df，从而产生4行而不是2行的脱节df：

right = DataFrame(bf.values [1::2], index=bf.index[1::2])

上面使用你的df中的值创建了一个新的df，但你也得到了索引值，因为左边和右边的df有相同的行数，你想要按列连接它们，这样索引对齐然后您可以使用左侧df中的相同索引：

right = DataFrame(bf.values [1::2], index=left.index)

这将产生所需的连接df。

如何将数据框中的2行连接到新行中的1行？

数据转换

将下载转换为有用的格式

1 个答案: