Question

我试图根据每个数据帧中找到的一列（称为“名称”）中的匹配值，将两个数据帧（df1和df2）连接起来。我已经使用R的inner_join函数以及Python的pandas merge函数进行了尝试，并且能够使这两个函数在我的较小数据子集上成功工作。我认为我的问题在于数据帧的大小。

我的数据帧如下：

df1的“名称”列中还有5列，并且有约900行。
df2的“名称”列中有〜200万列，另外还有900行。

我已经尝试过（在R中）：

df3 <- inner_join(x = df1, y = df2, by = 'Name')

我也尝试过（在df1和df2是熊猫数据帧的Python中）：

df3 = df1.merge(right = df2, how = 'inner', left_on = 1, right_on = 0)

（“名称”列位于df1的索引1和df2的索引0）

当我将以上内容应用于我的完整数据帧时，它运行了很长时间，最终崩溃了。另外，我怀疑问题可能出在我的df2的200万列上，所以我尝试将其子集（逐行）设置为较小的数据帧。我的计划是将df2的小子集与df1连接在一起，然后在最后将新数据帧行绑定在一起。但是，即使加入较小的分区df2也不成功。

任何人都能提供的任何建议，我将不胜感激。

Answer 1

感谢大家的帮助！按照@shadowtalker的建议使用data.table可以极大地加快该过程。仅供参考，以防万一有人尝试做类似的事情，df1约为400 mb，我的df2文件约为3gb。

我能够完成以下任务：

library(data.table)
df1 <- setDT(df1)
df2 <- setDT(df2)
setkey(df1, Name)
setkey(df2, Name)
df3 <- df1[df2, nomatch = 0]

Answer 2

这是一个非常丑陋的解决方法，在该方法中，我分解了df2的列，并将它们逐段添加。不确定是否可以使用，但值得尝试：

# First, I only grab the "Name" column from df2
df3 = df1.merge(right=df2[["Name"]], how="inner", on="Name")  

# Then I save all the column headers (excluding 
# the "Name" column) in a separate list
df2_columns = df2.columns[np.logical_not(df2.columns.isin(["Name"]))]

# This determines how many columns are going to get added each time.
num_cols_per_loop = 1000

# And this just calculates how many times you'll need to go through the loop
# given the number of columns you set to get added each loop
num_loops = int(len(df2_columns)/num_cols_per_loop) + 1

for i in range(num_loops):
    # For each run of the loop, we determine which rows will get added
    this_column_sublist = df2_columns[i*num_cols_per_loop : (i+1)*num_cols_per_loop]

    # You also need to add the "Name" column to make sure 
    # you get the observations in the right order
    this_column_sublist = np.append("Name",this_column_sublist)

    # Finally, merge with just the subset of df2
    df3 = df3.merge(right=df2[this_column_sublist], how="inner", on="Name")

就像我说的那样，这是一个丑陋的解决方法，但它可能会起作用。

内部联接与巨大的数据框（约200万列）

2 个答案: