AssertionError的解决方案:在连接Dataframe列表上的操作时,get_concat_dtype中的dtype确定无效

时间:2015-09-09 20:17:34

标签: python csv pandas

我有一个Dataframes列表,我试图使用连接函数进行组合。

dataframe_lists = [df1, df2, df3]

result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)

完整的追溯是:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
      2 check(dataframe_lists)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    753                        verify_integrity=verify_integrity,
    754                        copy=copy)
--> 755     return op.get_result()
    756 
    757 

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
    924 
    925             new_data = concatenate_block_managers(
--> 926                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
    927             if not self.copy:
    928                 new_data._consolidate_inplace()

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
   4150         raise AssertionError("Concatenating join units along axis0")
   4151 
-> 4152     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   4153 
   4154     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
   4139         return np.dtype('m8[ns]'), tslib.iNaT
   4140     else:  # pragma
-> 4141         raise AssertionError("invalid dtype determination in get_concat_dtype")
   4142 
   4143 

AssertionError: invalid dtype determination in get_concat_dtype

我认为错误在于其中一个数据帧是空的。我使用简单函数check来验证并返回空数据帧的标题:

  def check(list_of_df):

    headers = []
    for df in dataframe_lists:
        if df.empty is not True:
            continue
        else:  
            headers.append(df.columns)

    return headers

我想知道是否可以使用此函数,如果在空数据帧的情况下,只返回空数据帧的标题并将其附加到连接的数据帧。输出将是标题的单行(并且在重复列名称的情况下,只是标题的单个实例(如在连接函数的情况下)。我有两个示例数据源{{3 }}和one非空数据集。这是一个空的two

我希望得到的连接具有列标题...

 'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

将空数据框的标题与此行一起添加(如果它们是新的)。

 'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

我欢迎有关这方面的最佳方法的反馈。

如下面的答案详细说明,这是一个意想不到的结果:

不幸的是,由于这种材料的敏感性,我无法分享实际数据。引出要点中的内容如下:

A= data[data['RRT'] == 'A'] #Select just the columns with  from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']

对于每个新数据帧,我然后应用此逻辑:

for column_name, column in A.transpose().iterrows():
    AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']]  #get select columns indexed with dataframe, "A"

当我在空数据帧A:

上执行bound方法时
AColumns.count

这是输出:

<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>

最后,我使用以下内容导入了CSV:

data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True,  chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)

我不确定我能提供什么。连接方法适用于满足要求所需的所有其他数据帧。我还看了Pandas internals.py和完整的踪迹。要么我有太多的NaN列,重复的列名或混合的dtypes(后者是最不可能的罪魁祸首)。

再次感谢您的指导。

4 个答案:

答案 0 :(得分:11)

在我们的一个项目中,我们遇到了同样的错误。调试后我们发现了问题。我们的一个数据框有2列具有相同的名称。重命名其中一列后,问题就解决了。

答案 1 :(得分:7)

这通常意味着您在其中一个数据框中有两个具有相同名称的列。

您可以通过查看

的输出来检查是否是这种情况
len(df.columns) > len(np.unique(df.columns))

表示您尝试连接的每个数据帧df

您可以使用Counter查看罪魁祸首列,例如:

from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]

答案 2 :(得分:2)

我注意到,当连接或附加空数据帧时,它是可能的。请尝试以下示例:

    my_headers = ['A,' 'B', 'C']

我有一个带有值的DataFrame df_input,其中标题不一定与my_headers相同。

    dictionary = {element:None for element in my_headers}
    df = DataFrame(dictionary, index=[0])
    #append the two dataframes
    df_final = df_input.append(df)

答案 3 :(得分:0)

我无法重现您的错误,对我来说没问题:

df1 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/42708e6a3ca0aed9b79b/raw/f37738994c3285e1b670d3926e716ae027dc30bc/sample_data.csv')
df2 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/26eb4ce1578e0844eb82/raw/23d9063dad7793d87a2fed2275857c85b59d56bb/sample2.csv')
df3 = pd.read_csv('https://gist.githubusercontent.com/ahlusar1989/0721bd8b71416b54eccd/raw/b7ecae63beff88bd076a93d83500eb5fa67e1278/empty_df.csv')
pd.concat([df1,df2,df3], keys = ['one', 'two','three'], ignore_index=True).head()

Out[68]: 
   'B'  'C'  'D'  'E'  'F'  'G'  'A'  AT  AccountNum  AcctType ...   0  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
1  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
2  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN NaN         NaN       NaN ...    

   ToAccountNum  ToAccountT  TransferAmount  TransferMade  TransferTimestamp  0           NaN         NaN               4          True      1/7/2000 0:00   
1           NaN         NaN               4          True      1/8/2000 0:00   
2           NaN         NaN               6          True      1/9/2000 0:00   
3           NaN         NaN               6          True     1/10/2000 0:00   
4           NaN         NaN               0         False     1/11/2000 0:00   

   Ttype  Unnamed: 0  WA   WC  Zip  
0      D           4 NaN  NaN  NaN  
1      D           5 NaN  NaN  NaN  
2      D          13 NaN  NaN  NaN  
3      D          14 NaN  NaN  NaN  
4      T          25 NaN  NaN  NaN  

[5 rows x 41 columns]