Question

我想从我的所有csv文件中仅获取第4列中的数据，并将数据写入单个文件。每个第4列都有一个唯一的标题名称，其名称为根文件夹+ csv名称，如FolderA1

FolderA /

1.csv |INFO  INFO  INFO  FolderA1  INFO
       Apple Apple Apple Orange    Apple

2.csv |INFO  INFO  INFO  FolderA2 INFO
       Apple Apple Apple Cracker  Apple

3.csv |INFO  INFO  INFO  FOLDERA3 INFO
       Apple Apple Apple Orange  Apple

我怎样才能将第4列数据过滤到单个.xlsx文件中，并将下一个文件夹csv放入新工作表中，或将其与之前的文件夹csv＆s 39分开？

concentrated.xlsx | FOLDERA1 FOLDERA2 FOLDERA3   FOLDERB1 FOLDERB2 FOLDERB3
                    ORANGE   CRACKER   ORANGE    ORANGE   CRACKER  ORANGE

Answer 1

我会使用usecols附带的pandas.read_csv参数。

def read_4th(fn):
    return pd.read_csv(fn, delim_whitespace=1, usecols=[3])

files = ['./1.csv', './2.csv', './3.csv']

big_df = pd.concat([read_4th(fn) for fn in files], axis=1)

big_df.to_excel('./mybigdf.xlsx')

对于多个文件夹，请使用glob。

假设您有2个文件夹'FolderA'和'FolderB'都位于'./'文件夹中，并且您想要两个文件夹中的所有csv文件。

from glob import glob

files = glob('./*/*.csv')

然后按照上面的说明运行其余部分。

Answer 2

其他答案建议使用Pandas作为选项，这肯定会有效，但如果您正在寻找纯粹使用Python库的解决方案，您可以尝试使用CSV模块和迭代器

这里需要注意的是，根据您需要连接的文件数量，您可能会遇到内存限制。但抛开这一点，这是一种方法。

基本Python库

import csv
from glob import glob
from itertools import izip_longest, imap

# Use glob to recursively get all CSV files. Adjust the pattern according to your need
input_files = (open(file_path, 'rb') for file_path in glob('*.csv'))

# Using generators, we can wrap all the CSV files in reader instances
input_readers = (csv.reader(input_file) for input_file in input_files)

with open('output.csv', 'wb') as output_file:
    output_writer = csv.writer(output_file)

    # izip_longest will return a tuple of the next value 
    # for all the iterables passed as parameters
    # In this case, this means the next row for all the input_readers
    for rows in izip_longest(*input_readers):

        # We extract the fourth column in all the rows
        # Note that this presumes that all files have a fourth column.
        # Some error checking/handling might be required if 
        # you are not sure that's the case 
        fourth_columns = imap(lambda row: row[3], rows)

        # Write to the output the row that is all the 
        # fourth columns for all the readers
        output_writer.writerow(fourth_columns)

# Clean up the opened files
map(lambda f: f.close(), input_files)

通过使用生成器，您可以同时最小化要在内存中加载的数据量，同时保持非常Pythonic方法来解决问题。

使用glob模块可以更轻松地加载具有已知模式的多个文件，这似乎就是您的情况。如果它更适合，请随意使用其他形式的文件查找替换它，例如os.path.walk。

Answer 3

这样的事情应该有效：

import pandas as pd

input_file_paths = ['1.csv', '2.csv', '3.csv']

dfs = (pd.read_csv(fname) for fname in input_file_paths)

master_df = pd.concat(
    (df[[c for c in df.columns if c.lower().startswith('folder')]]
        for df in dfs), axis=1)

master_df.to_excel('smth.xlsx')

df[[c for c in df.columns if c.lower().startswith('folder')]]行是因为您的示例文件夹列的格式不一致。

从多个csv文件中获取一个特定列并合并为一个

3 个答案:

基本Python库