Question

我有一堆（15,000多个）小数据框，我需要按列连接以在熊猫中制作一个非常大的（100,000x1000）数据框。我有两个（显而易见的）问题，速度和内存使用情况。

以下是我在Stack Overflow上得到高度认可的一种方法。

df <- df[,-1]
df_long <- df %>% gather(Family, Relabund, -time)
df_relabund <- data_summary(df_long, varname="Relabund", groupnames=c("time", "Family"))

ggplot(df_relabund, aes(x = time, y = Relabund, color = Family, group = Family)) +
  geom_point() +
  geom_line() +
#  scale_color_brewer(palette = 'Dark2') +
  theme_classic(base_size = 12) +
  geom_errorbar(aes(ymin=Relabund-sd, ymax=Relabund+sd), width=.05,
              position = position_dodge(0.05)) +
  ylab("Relative Abundance") + xlab("Time") + theme_bw() 
```[![enter image description here][1]][1]


  [1]: https://i.stack.imgur.com/p7ZjT.png

这非常适合提高速度。这是简单易懂的代码。但是，它使用了大量的内存。我的理解是，Pandas的main.cpp: In function ‘std::string class_name(const std::type_info&)’: main.cpp:43:45: error: ‘pos’ was not declared in this scope if (const size_t pos = name.find(prefix)); pos != string::npos)函数的工作原理是制作一个新的大数据框，然后复制所有信息，实质上使程序消耗的内存量加倍。

如何避免在速度最小的情况下避免这么大的内存开销？

我尝试只将列逐一添加到for循环的第一个df中。伟大的内存（1 + 1 / 15,000），可怕的速度。

然后我提出了以下内容。我将列表替换为双端队列，并适当进行串联。它以可控制的速度降低（在5-6分钟的总长度脚本中添加了<30秒），节省了内存（最新运行为4.1GB，而最新运行为5.4GB），但我似乎不知道为什么这样可以节省内存吗？

#include <string>
#include <map>
#include <array>
#include <vector>
#include <iostream>
#include <iomanip>
#include <sstream>
#include <typeinfo>

using namespace std;

void horizontal_line(size_t n = 80)
{
    cout << endl << string(n, '-');
}

void pause(size_t n = 80)
{
    horizontal_line(n);
    cout << "\n[Enter] to continue.";
    cin.get();
}

string currency(const float& amount)
{
    ostringstream ss;
    ss.imbue(std::locale(""));
    ss << showbase << put_money(amount * 100);
    return ss.str();
}

string class_name(const type_info& typeinfo)
{
    static const string prefix("class ");
    static const size_t length = prefix.size();

    string name(typeinfo.name ());
    if (const size_t pos = name.find(prefix)); pos != string::npos)
    name.erase(pos, length);
    return name;
}

如果我对dfList = [df1, df2, ..., df15000] #made by appending in a for loop df_out = pd.concat(dfList, axis=1)函数的理解是正确的，则此pepewise级联的最后一步仍应使用2倍的内存。是什么使这项工作？虽然我在上面引用的速度提高和节省内存的数字是特定于一次运行的，但在多次运行中，总的趋势是相同的。

除了试图弄清楚为什么上述方法可行外，还对方法学提出了其他建议。

Answer 1

只需提前创建完整尺寸的DataFrame：

df = pd.DataFrame(index=np.arange(0, N), columns=[...])

然后将其写成以下部分：

col = 0
for path in paths:
    part = pd.read_csv(path)
    df.iloc[:,col:col+part.shape[1]] = part
    col += part.shape[1]

减少pandas concat对大量数据帧的内存使用

1 个答案: