我正在尝试拆分并合并Pandas数据帧。
原始数据框的列排列如下:
dataTime Record1Field1 ... Record1FieldN Record2Field1 ... Record1FieldN
time1 << record 1 data >> << record 2 data >>
我想将Record2
字段拆分为一个单独的数据框tempdf
,由dataTime索引。 tempdf
因此看起来像这样:
dataTime Record2Field1 ... Record2FieldN
time1 << record 2 data >>
填充tempdf
后,从原始数据框中删除Record2列。我遇到的第一个困难是创建包含记录2数据的tempdf
。
然后,我想重命名tempdf
中的列,以便它们与原始数据框中的Record1
列对齐。 (这部分我知道该怎么做)
最后,我想将tempdf
合并回原始数据框。
最终结果应如下所示:
dataTime Record1Field1 ... Record1FieldN
time1 <<record 1 data>>
time1 <<record 2 data>>
到目前为止,我还没有确定一个很好的方法。任何帮助表示赞赏!感谢。
答案 0 :(得分:1)
另一种清理和合并两个数据集的方法:
df3 = df1[8:]
df4 = df2[8:]
tmp_col1 = [1,2,3,4,5,6,7,8]
tmp_col2 = [1,2,3,4,5,6,7,8,9]
tmp_col3 = [1,2,3,4,5,6,7]
col_name1= df1.columns[0]
col_name2 = df2.columns[0]
df5 = df3[df3[col_name1].notna()]
df6 = df4[df4[col_name1].notna()]
data = df1.iloc[[2],[6]].values[0]
print(data)
df5.columns = tmp_col1
df6.columns = tmp_col2
df5 = df5[[1,2,3,4,6,7]]
df5 = df5.reset_index()
df5.drop(df5.columns[[0]], axis=1, inplace=True)
df5[8] = pd.Series([data])
df6 = df6[[1,2,3,4,6,9,8]]
df6 = df6.reset_index()
df6.drop(df6.columns[[0]], axis=1, inplace=True)
print(df5)
print(df6)
df5.columns = tmp_col3
df6.columns = tmp_col3
dfs=[df5,df6]
df7 = pd.concat(dfs)
df7.columns = ["","","",""]
print(df7)
答案 1 :(得分:0)
尝试使用concat
尝试类似的事情:
Combined = [DataFrame1,DataFrame2]
Together = pandas.concat(Combined)
正如其他人所评论的那样 - merge也可能是个不错的选择。
答案 2 :(得分:0)
如果您知道要选择的列,请使用
tempdf = df[['a','b']]
否则选择最后2列使用
tempdf = df[df.columns[-2:]]
答案 3 :(得分:0)
要回答您的直接问题,您可以使用带有正则表达式模式的df.filter
来选择表单Record2FieldN
的列:
In [29]: tempdf = df.filter(regex=r'Record2.*'); tempdf
Out[29]:
Record2Field0 Record2Field1 Record2Field2
0 3 8 4
1 2 6 3
2 1 2 2
3 5 9 4
您可以使用tempdf.rename
重命名列:
tempdf = tempdf.rename(columns={'Record2Field{}'.format(i):'Record1Field{}'.format(i) for i in range(3)})
和drop来自Record2
的{{1}}字段:
df
但是您可以更好地解决整体问题:将平面列名df = df.drop(['Record2Field{}'.format(i) for i in range(3)], axis=1)
替换为2级MultiIndex
,将RecordMFieldN
与Record
分开。
这将为您提供足够的控制权来堆叠所需格式的数据:
Field
产量
import numpy as np
import pandas as pd
np.random.seed(2016)
ncols, nrows = 3, 4
def make_dataframe(ncols, nrows):
columns = ['Record{}Field{}'.format(i, j) for i in range(1,3)
for j in range(ncols)]
df = pd.DataFrame(np.random.randint(10, size=(nrows, 2*ncols)), columns=columns)
df['dataTime'] = pd.date_range('2000-1-1', periods=nrows)
return df
df = make_dataframe(ncols, nrows)
# stash the `dataTime` in the row index so we can reassign
# the column index to `new_index`
result = df.set_index('dataTime')
new_index = pd.MultiIndex.from_product([[1,2], df.columns[:ncols]],
names=['record', 'field'])
result.columns = new_index
# Now the problem can be solved by stacking.
result = result.stack('record')
result.index = result.index.droplevel('record')
答案 4 :(得分:0)
您可以在Record2
列下获取所有Record1
值,如下所示:
数据设置:
data = StringIO(
'''
dataTime Record1Field1 Record1Field2 Record1Field3 Record2Field1 Record2Field2 Record2Field3
01-01-2015 1 2 3 4 5 6
''')
df = pd.read_csv(data, delim_whitespace=True, parse_dates=['dataTime'])
print (df)
dataTime Record1Field1 Record1Field2 Record1Field3 Record2Field1 \
0 2015-01-01 1 2 3 4
Record2Field2 Record2Field3
0 5 6
<强>运营:强>
df.set_index('dataTime', inplace=True)
# Filter column names corresponding to Record2
tempdf = df[[col for col in list(df) if col.startswith('Record2')]]
# Drop those columns after assigning to tempdf
df.drop(tempdf.columns, inplace=True, axis=1)
# Rename the column names for appending
tempdf.columns = [col for col in list(df) if col.startswith('Record1')]
# Concatenate row-wise
print (df.append(tempdf))
Record1Field1 Record1Field2 Record1Field3
dataTime
2015-01-01 1 2 3
2015-01-01 4 5 6
答案 5 :(得分:0)
试试这个代码,它的工作原理是基于空行拆分 df,然后将标识符添加到数据集,然后将它们合并在一起。
df_list = np.split(df, df[df.isnull().all(1)].index)
df0=df_list[0]
data = df0.iloc[[0],[0]].values[0]
df1=df_list[1]
df2= df_list[2]
df1['status'] = ''
df2['status'] = ''
df3 = df2[3:-1]
df4 = df1[3:-1]
dfs=[df4,df3]
df5= pd.concat(dfs)
col=[]
for i in df.iloc[8]:
col.append(i)
col.append('status')
df5.columns= col
df5= df5.reset_index()
df5.drop(df5.columns[[0]], axis=1, inplace=True)
df5['ID'] = pd.Series([data])
print(df5)
答案 6 :(得分:0)
如果要根据列的值进行拆分:
col_name = df.columns[0]
ict = df[df[col_name] == 'CT'].index
print(ict)
df_list = np.split(df, ict)
df1 = df_list[0]
df2 = df_list[1]
df1['status'] = ''
df2['status'] = ''
df1 = df1[9:]
df2 = df2[4:-4]
dfs=[df1,df2]
df3= pd.concat(dfs)
col=[]
for i in df.iloc[8]:
col.append(i)
col.append('status')
df3.columns= col
df3 = df3.reset_index()
df3.drop(df3.columns[[0]], axis=1, inplace=True)
data = df.iloc[[0],[0]].values[0]
df3['ID'] = pd.Series([data])
print(df3)