Question

我有一个Dataframes列表，我试图使用连接函数进行组合。

dataframe_lists = [df1, df2, df3]

result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)

完整的追溯是：

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
      2 check(dataframe_lists)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    753                        verify_integrity=verify_integrity,
    754                        copy=copy)
--> 755     return op.get_result()
    756 
    757 

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
    924 
    925             new_data = concatenate_block_managers(
--> 926                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
    927             if not self.copy:
    928                 new_data._consolidate_inplace()

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
   4150         raise AssertionError("Concatenating join units along axis0")
   4151 
-> 4152     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   4153 
   4154     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
   4139         return np.dtype('m8[ns]'), tslib.iNaT
   4140     else:  # pragma
-> 4141         raise AssertionError("invalid dtype determination in get_concat_dtype")
   4142 
   4143 

AssertionError: invalid dtype determination in get_concat_dtype

我认为错误在于其中一个数据帧是空的。我使用简单函数check来验证并返回空数据帧的标题：

  def check(list_of_df):

    headers = []
    for df in dataframe_lists:
        if df.empty is not True:
            continue
        else:  
            headers.append(df.columns)

    return headers

我想知道是否可以使用此函数，如果在空数据帧的情况下，只返回空数据帧的标题并将其附加到连接的数据帧。输出将是标题的单行（并且在重复列名称的情况下，只是标题的单个实例（如在连接函数的情况下）。我有两个示例数据源{{3 }}和one非空数据集。这是一个空的two。

我希望得到的连接具有列标题...

 'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

将空数据框的标题与此行一起添加（如果它们是新的）。

 'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

我欢迎有关这方面的最佳方法的反馈。

如下面的答案详细说明，这是一个意想不到的结果：

不幸的是，由于这种材料的敏感性，我无法分享实际数据。引出要点中的内容如下：

A= data[data['RRT'] == 'A'] #Select just the columns with  from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']

对于每个新数据帧，我然后应用此逻辑：

for column_name, column in A.transpose().iterrows():
    AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']]  #get select columns indexed with dataframe, "A"

当我在空数据帧A：

上执行bound方法时

AColumns.count

这是输出：

<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>

最后，我使用以下内容导入了CSV：

data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True,  chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)

我不确定我能提供什么。连接方法适用于满足要求所需的所有其他数据帧。我还看了Pandas internals.py和完整的踪迹。要么我有太多的NaN列，重复的列名或混合的dtypes（后者是最不可能的罪魁祸首）。

再次感谢您的指导。

Answer 1

在我们的一个项目中，我们遇到了同样的错误。调试后我们发现了问题。我们的一个数据框有2列具有相同的名称。重命名其中一列后，问题就解决了。

Answer 2

这通常意味着您在其中一个数据框中有两个具有相同名称的列。

您可以通过查看

的输出来检查是否是这种情况

len(df.columns) > len(np.unique(df.columns))

表示您尝试连接的每个数据帧df。

您可以使用Counter查看罪魁祸首列，例如：

from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]

Answer 3

我注意到，当连接或附加空数据帧时，它是可能的。请尝试以下示例：

    my_headers = ['A,' 'B', 'C']

我有一个带有值的DataFrame df_input，其中标题不一定与my_headers相同。

    dictionary = {element:None for element in my_headers}
    df = DataFrame(dictionary, index=[0])
    #append the two dataframes
    df_final = df_input.append(df)

Answer 4

我无法重现您的错误，对我来说没问题：

df1 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/42708e6a3ca0aed9b79b/raw/f37738994c3285e1b670d3926e716ae027dc30bc/sample_data.csv')
df2 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/26eb4ce1578e0844eb82/raw/23d9063dad7793d87a2fed2275857c85b59d56bb/sample2.csv')
df3 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/0721bd8b71416b54eccd/raw/b7ecae63beff88bd076a93d83500eb5fa67e1278/empty_df.csv')
pd.concat([df1,df2,df3], keys = ['one', 'two','three'], ignore_index=True).head()

Out[68]: 
   'B'  'C'  'D'  'E'  'F'  'G'  'A'  AT  AccountNum  AcctType ...   0  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
1  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
2  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    

   ToAccountNum  ToAccountT  TransferAmount  TransferMade  TransferTimestamp  0           NaN         NaN               4          True      1/7/2000 0:00   
1           NaN         NaN               4          True      1/8/2000 0:00   
2           NaN         NaN               6          True      1/9/2000 0:00   
3           NaN         NaN               6          True     1/10/2000 0:00   
4           NaN         NaN               0         False     1/11/2000 0:00   

   Ttype  Unnamed: 0  WA   WC  Zip  
0      D           4 NaN  NaN  NaN  
1      D           5 NaN  NaN  NaN  
2      D          13 NaN  NaN  NaN  
3      D          14 NaN  NaN  NaN  
4      T          25 NaN  NaN  NaN  

[5 rows x 41 columns]

AssertionError的解决方案：在连接Dataframe列表上的操作时，get_concat_dtype中的dtype确定无效

4 个答案: