使用python和pandas将三列合并为CSV文件中的一列

时间:2018-05-13 13:46:49

标签: python pandas csv dataframe

您好我尝试将几个现有列合并为一个新列,然后删除CSV文件中的三个原始列。我一直试图用熊猫做这个,但没有太多运气。我对python很新。

我的代码首先将多个CSV文件组合在同一目录中,然后尝试操作列。第一个组合工作,我得到一个带有组合数据的output.csv,但列的组合没有。

with open("cyclesAndSignalChange.csv", 'wb') as csvfile:
    wr = csv.writer(csvfile, delimiter=' ')
    wr.writerow(['A', 'B', 'C', 'D'])
    for key, value in cycle_with_signal_change.items():
        wr.writerow([key, *value])

有效地解决这个问题:

import glob
import pandas as pd

interesting_files = glob.glob("*.csv")

header_saved = False
with open('output.csv','wb') as fout:
    for filename in interesting_files:
        with open(filename) as fin:
            header = next(fin)
            if not header_saved:
                fout.write(header)
                header_saved = True
            for line in fin:
                fout.write(line)

df = pd.read_csv("output.csv")
df['HostAffected']=df['Host'] + "/" + df['Protocol'] + "/" + df['Port']
df.to_csv("newoutput.csv")

这样的事情:

Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670

然而,csv中还有其他列。

我不是编码员,我只是想解决问题,任何帮助都非常感激。

2 个答案:

答案 0 :(得分:2)

我认为,我们有三种选择:

10 loops, best of 3: 39.7 ms per loop  
10 loops, best of 3: 35.9 ms per loop  
10 loops, best of 3: 162 ms per loop

<强>计时

import pandas as pd

data = '''\
ID,Host,Protocol,Port
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,49707
1,10.0.0.10,tcp,49672
1,10.0.0.10,tcp,49670'''

df = pd.read_csv(pd.compat.StringIO(data)) # Recreates a sample dataframe

cols = ['Host','Protocol','Port']
newcol = ['/'.join(i) for i in df[cols].astype(str).values]
df = df.assign(HostAffected=newcol).drop(cols, 1)
print(df)

无论如何最慢,我认为这将是您最具可读性的方法:

   ID         HostAffected
0   1    10.0.0.10/tcp/445
1   1    10.0.0.10/tcp/445
2   1    10.0.0.10/tcp/445
3   1    10.0.0.10/tcp/445
4   1    10.0.0.10/tcp/445
5   1    10.0.0.10/tcp/445
6   1    10.0.0.10/tcp/445
7   1  10.0.0.10/tcp/49707
8   1  10.0.0.10/tcp/49672
9   1  10.0.0.10/tcp/49670

返回:

+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----+
|   Time   | 12:00 | 12:01 | 12:02 | 12:03 | 12:04 | 12:05 | 12:06 | 12:07 | 12:08 | ... |
+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----+
| Series 1 |     8 |       |     2 |       |     4 |       |     8 |       |     6 |     |
| Series 2 |       |     5 |       |     4 |       |     7 |       |     2 |       |     |
| Series 3 |     5 |       |       |       |     7 |       |       |       |     2 |     |
| ...      |       |       |       |       |       |       |       |       |       |     |
+----------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-----+

答案 1 :(得分:0)

这是你可以做到的:

    dt = """Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670"""

tdf = pd.read_csv(pd.compat.StringIO(dt))
tdf['HostsAffected'] = tdf.apply(lambda x: '{}/{}/{}'.format(x['Host'] , x['Protocol'] , x['Port']), axis=1)
tdf = tdf[['HostsAffected']]
tdf.to_csv(<path-to-save-csv-file>)

这将是输出:

    HostsAffected
0   10.0.0.10/tcp/445
1   10.0.0.10/tcp/445
2   10.0.0.10/tcp/445
3   10.0.0.10/tcp/445
4   10.0.0.10/tcp/445
5   10.0.0.10/tcp/445
6   10.0.0.10/tcp/445
7   10.0.0.10/tcp/49707
8   10.0.0.10/tcp/49672
9   10.0.0.10/tcp/49670

如果您正在从文件中读取CSV,请按如下所示编辑read_csv行:

tdf = pd.read_csv(<path-to-the-file>)