我有一个Dataframes列表,我试图使用连接函数进行组合。
dataframe_lists = [df1, df2, df3]
result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
完整的追溯是:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
2 check(dataframe_lists)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
753 verify_integrity=verify_integrity,
754 copy=copy)
--> 755 return op.get_result()
756
757
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
924
925 new_data = concatenate_block_managers(
--> 926 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
927 if not self.copy:
928 new_data._consolidate_inplace()
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
4061 copy=copy),
4062 placement=placement)
-> 4063 for placement, join_units in concat_plan]
4064
4065 return BlockManager(blocks, axes)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
4061 copy=copy),
4062 placement=placement)
-> 4063 for placement, join_units in concat_plan]
4064
4065 return BlockManager(blocks, axes)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
4150 raise AssertionError("Concatenating join units along axis0")
4151
-> 4152 empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
4153
4154 to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
4139 return np.dtype('m8[ns]'), tslib.iNaT
4140 else: # pragma
-> 4141 raise AssertionError("invalid dtype determination in get_concat_dtype")
4142
4143
AssertionError: invalid dtype determination in get_concat_dtype
我认为错误在于其中一个数据帧是空的。我使用简单函数check
来验证并返回空数据帧的标题:
def check(list_of_df):
headers = []
for df in dataframe_lists:
if df.empty is not True:
continue
else:
headers.append(df.columns)
return headers
我想知道是否可以使用此函数,如果在空数据帧的情况下,只返回空数据帧的标题并将其附加到连接的数据帧。输出将是标题的单行(并且在重复列名称的情况下,只是标题的单个实例(如在连接函数的情况下)。我有两个示例数据源{{3 }}和one非空数据集。这是一个空的two。
我希望得到的连接具有列标题...
'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
将空数据框的标题与此行一起添加(如果它们是新的)。
'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
我欢迎有关这方面的最佳方法的反馈。
如下面的答案详细说明,这是一个意想不到的结果:
不幸的是,由于这种材料的敏感性,我无法分享实际数据。引出要点中的内容如下:
A= data[data['RRT'] == 'A'] #Select just the columns with from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']
对于每个新数据帧,我然后应用此逻辑:
for column_name, column in A.transpose().iterrows():
AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']] #get select columns indexed with dataframe, "A"
当我在空数据帧A:
上执行bound方法时AColumns.count
这是输出:
<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>
最后,我使用以下内容导入了CSV:
data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True, chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)
我不确定我能提供什么。连接方法适用于满足要求所需的所有其他数据帧。我还看了Pandas internals.py和完整的踪迹。要么我有太多的NaN列,重复的列名或混合的dtypes(后者是最不可能的罪魁祸首)。
再次感谢您的指导。
答案 0 :(得分:11)
在我们的一个项目中,我们遇到了同样的错误。调试后我们发现了问题。我们的一个数据框有2列具有相同的名称。重命名其中一列后,问题就解决了。
答案 1 :(得分:7)
这通常意味着您在其中一个数据框中有两个具有相同名称的列。
您可以通过查看
的输出来检查是否是这种情况len(df.columns) > len(np.unique(df.columns))
表示您尝试连接的每个数据帧df
。
您可以使用Counter
查看罪魁祸首列,例如:
from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]
答案 2 :(得分:2)
我注意到,当连接或附加空数据帧时,它是可能的。请尝试以下示例:
my_headers = ['A,' 'B', 'C']
我有一个带有值的DataFrame df_input,其中标题不一定与my_headers
相同。
dictionary = {element:None for element in my_headers}
df = DataFrame(dictionary, index=[0])
#append the two dataframes
df_final = df_input.append(df)
答案 3 :(得分:0)
我无法重现您的错误,对我来说没问题:
df1 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/42708e6a3ca0aed9b79b/raw/f37738994c3285e1b670d3926e716ae027dc30bc/sample_data.csv')
df2 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/26eb4ce1578e0844eb82/raw/23d9063dad7793d87a2fed2275857c85b59d56bb/sample2.csv')
df3 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/0721bd8b71416b54eccd/raw/b7ecae63beff88bd076a93d83500eb5fa67e1278/empty_df.csv')
pd.concat([df1,df2,df3], keys = ['one', 'two','three'], ignore_index=True).head()
Out[68]:
'B' 'C' 'D' 'E' 'F' 'G' 'A' AT AccountNum AcctType ... 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
ToAccountNum ToAccountT TransferAmount TransferMade TransferTimestamp 0 NaN NaN 4 True 1/7/2000 0:00
1 NaN NaN 4 True 1/8/2000 0:00
2 NaN NaN 6 True 1/9/2000 0:00
3 NaN NaN 6 True 1/10/2000 0:00
4 NaN NaN 0 False 1/11/2000 0:00
Ttype Unnamed: 0 WA WC Zip
0 D 4 NaN NaN NaN
1 D 5 NaN NaN NaN
2 D 13 NaN NaN NaN
3 D 14 NaN NaN NaN
4 T 25 NaN NaN NaN
[5 rows x 41 columns]