使用熊猫读取多个文件

时间:2019-04-17 16:05:21

标签: python pandas dataframe

我想一次读取多个文件。我在两个文件中有数据,如下所示:

数据:

from tensorflow import *

数据1:

123.22.21.11,sid
112.112.11.1,john
110.11.23.23,jenny
122.23.21.13,ankit  

我按照this链接尝试了几个答案。下面是我的代码:

145.123.11.1, Joaquin  

我运行这段代码后,输出如下:

df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(" ", "/home/cloudera/Desktop/sample/*"))))  

但是当我显示时,我需要如下所示以及不同列中的输出:

>>> df
   123.22.21.11 145.123.11.1 Joaquin    sid
0  112.112.11.1          NaN     NaN    NaN
1  110.11.23.23          NaN     NaN    NaN
2  122.23.21.13          NaN     NaN    NaN
0  112.112.11.1          NaN     NaN   john
1  110.11.23.23          NaN     NaN  jenny
2  122.23.21.13          NaN     NaN  ankit

那我该怎么办?

2 个答案:

答案 0 :(得分:1)

您的问题是pd.read_csv()在默认情况下需要列标题/名称。 Concat使用它们进行匹配。我可以使用names=None将kwarg "partial"传递到map中。

import glob
import os
import pandas as pd
from functools import partial
mapfunc = partial(pd.read_csv, header=None)
df = pd.concat(map(mapfunc, glob.glob(os.path.join(" ", "/home/cloudera/Desktop/sample/*"))))

输出:

              0         1
0  123.22.21.11       sid
1  112.112.11.1      john
2  110.11.23.23     jenny
3  122.23.21.13     ankit
0  145.123.11.1   Joaquin

您可以在此处查看部分信息: Using map() function with keyword arguments

根据请求进行编辑:

它不是很漂亮,但是您可以遍历目录并一次使用可变的“计数器”来一次处理“计数器”文件。

# Initialize Variables
fpath = "C:/Users/5188048/Desktop/example/"
interval = 5
filenames = []

# loop through files in directory
for i, j in enumerate(os.listdir(fpath)):

    # append filenames to list, initialized previously
    filenames.append(j)

    # for every interval'th file, perform this...
    if (i+1)%interval==0:

        # use first file to initialize dataframe
        temp_df = pd.read_csv(fpath+filenames[0], header=None)

        # loop through remaining files
        for file in filenames[1:]:

            # concatenate additional files to dataframe
            temp_df = pd.concat([temp_df, pd.read_csv(fpath+file, header=None)], ignore_index=True)

        # do your manipulation here, example reset column names
        temp_df.columns = ['IP_Address', 'Name']

        # Generate outfile variable name & path
        out_file = fpath+'out_file_' + str(int((i+1)/interval)) + '.csv'

        # write outfile to csv
        temp_df.to_csv(out_file, index=False)

        # reset variable
        filenames = []

    else:

        pass

答案 1 :(得分:1)

我认为将其分为几个步骤会更容易且更具可读性。您还想通过将onLogoutSuccess()传递给header=None来明确地告诉熊猫没有标题。

pd.read_csv