Question

我有一个约有100万行和45列的DataFrame。

对于一个上传程序，我们现在必须对其进行拆分，因为该系统每次运行仅管理1000个。它是完全自动化的-但我们无法更改。最后，我需要导出几个CSV。

但是我不知道如何用Python最好地解决这个问题。首先将数据框拆分为几个数据框还是直接导出为CSV？

重要的是以下

每个CSV需要相同的第一行，即数据帧中的第一行。在那之后应该有1000个职位。该文件不包含运行计数器，ID或其他。

Answer 1

您是否尝试过以下操作：

const floor = Math.floor;
const divs = document.querySelectorAll('.container div');
const tileSize = 40;

const tiles = Array.prototype.reduce.call(divs, (a, t, i) => {
  const ai = floor(i / 3);
  return ((a[ai] = a[ai] || []), (a[ai][i % 3] = t), a);
}, []);

const unselect = () => divs.forEach(d => d.style.background = '');
const select = (r, c) => tiles[r] && tiles[r][c] && (tiles[r][c].style.background = 'red');

document.addEventListener('mousemove', (e) => {
  const x = e.pageX;
  const y = e.pageY;
  const Ix = x / Math.sqrt(3) + y - 60;
  const Iy = Ix - 2 * (y - 60);  
  const row = floor(Ix / tileSize);
  const col = floor(Iy / tileSize);
  unselect();
  select(row, col);
});

Answer 2

您可以使用：

pd.DataFrame(np.random.rand(100, 5), columns = ['A', 'B', 'C', 'D', 'E']).to_csv('bigfile.csv', index=False)
chunks = pd.read_csv('bigfile.csv', chunksize=10, iterator=False)
for n, chunk in enumerate(chunks):
    chunk.to_csv(f'file_{n}', index=False, header=True)

Answer 3

仅是我的意见，但我只是将整个数据帧导出到一个大型CSV文件中。因为在Python中分割文本文件（csv文件是什么）是微不足道的，并且仅受磁盘速度的限制。

Python代码如下：

def split(infile, outtemplate, maxlines, first=0):
    """Splits an input file in chunks of size maxlines.
    The initial line will be repeated in each of the output files.
    Params:
        infile:      path of the input file
        outtemplate: template for the paths of the output file; will use format to insert
                       the chunk number
        maxlines:    maximim number of lines per chunk
        first:       number of the first output file
    """
    with open(infile, "rb") as fdin:  # use binary to not worry for encoding
        header_line = next(fdin)      # store the initial line 
        fdout = None
        for line in fdin:
            if fdout is None:         # if no output file create one 
                numlig = 0
                fdout = fopen(template.format(first), "wb")
                fdout.write(header)   # do not forget the header
            fdout.write(line)
            numlig += 1
            if numlig >= maxlines:    # limited to maxlines
                fdout.close()
                fdout = None          # prepare for next chunk
                first += 1
    fdout.close()                     # close the current output file

它可以用作：

split("/path/to/initial.csv", "/path/to/resul_{}.csv", 1000)

当心：这假设每行只有一行。如果某些字段可以包含换行符，则不要使用，而要使用csv模块。

当心（2）：使用此代码，结果文件将长1001行：标题行后跟1000条数据行。使用split(..., ..., 999)使文件长度恰好为1000行。

当心（3）：未经测试的代码...

Answer 4

只需执行以下操作：

#Load your CSV
df = pd.read_csv("Your-file.csv") #Load the file

#Write out the lines you need with a while
i = 0
j = 1000
while j < len(equzi):
    equzi[i:j].to_csv('output'+str(j)+'.csv', index=False, sep=";")
    i = j + 1
    j += 1000

因为您说的第一行应该是相同的，在我的理解中这是标题。

将数据框导出到不同的CSV文件，每个文件包含1000行

4 个答案: