如果列值来自不同的文件,如何将列插入数据框?

时间:2017-08-29 01:32:04

标签: python pandas

目前我正在从一个文件读入,并且它正在生成此文件(output.txt):

Atom nVa avgppm stddev delta
1.H1' 2 5.73649 0.00104651803616 1.0952e-06
1.H2' 1 4.85438
1.H8 1 8.05367
10.H1' 3 5.33823 0.136655138213 0.0186746268
10.H2' 1 4.20449
10.H5 3 5.27571333333 0.231624986634 0.0536501344333
10.H6 5 7.49485 0.0285124165935 0.0008129579

这是读取生成此文件的代码(我正在从文本文件中读取以生成这些值)

df = pd.read_csv(expAtoms, sep = ' ', header = None)
df.columns = ["Atom","ppm"]
gb = (df.groupby("Atom", as_index=False).agg({"ppm":["count","mean","std","var"]}).rename(columns={"count":"nVa", "mean":"avgppm","std":"stddev","var":"delta"}))

gb.head()

gb.columns = gb.columns.droplevel()
gb = gb.rename(columns={"":"Atom"})

gb.to_csv("output.txt", sep =" ", index=False)

在我的nVa列和我的avgppm列之间,我想插入另一个名为predppm的列。我想从名为file.txt的文件中获取值,如下所示:

5.H6 7.72158 0.3
6.H6 7.70272 0.3
7.H8 8.16859 0.3
1.H1' 7.65014 0.3
9.H8 8.1053 0.3
10.H6 7.5231 0.3

如何检查file.txt的第一列中的值是否为output.txt中第一列的值,如果是,则插入{{1}的第二列中的值进入我的输出文件中nVa列和avgppm列之间的一列?

例如,file.txt位于output.txt和file.txt中,因此我想在output.txt文件中创建一个名为1.H1'的列,其值为predppm (为7.65014原子插入的(来自file.txt的第二列)。

我想我理解如何添加列,但仅限于我可以与groupby一起使用的函数,但我不知道如何在输出中插入任意列。

1 个答案:

答案 0 :(得分:1)

最简单的方法是在index上制作pandas.DataFrame。 Pandas有很好的逻辑来匹配索引。

from io import StringIO
import pandas as pd

# if python2, do:
# data = u"""\
data = """\
Atom nVa avgppm stddev delta
1.H1' 2 5.73649 0.00104651803616 1.0952e-06
1.H2' 1 4.85438
1.H8 1 8.05367
10.H1' 3 5.33823 0.136655138213 0.0186746268
10.H2' 1 4.20449
10.H5 3 5.27571333333 0.231624986634 0.0536501344333
10.H6 5 7.49485 0.0285124165935 0.0008129579
"""

# if python2, do:
# other_data = u"""\
other_data = """\
5.H6 7.72158 0.3
6.H6 7.70272 0.3
7.H8 8.16859 0.3
1.H1' 7.65014 0.3
9.H8 8.1053 0.3
10.H6 7.5231 0.3
"""

# setup these strings so they can be read by pd.read_csv
# (not necessary if these are actual files on disk)
data_file = StringIO(data)
other_data_file = StringIO(other_data)

# don't say header=None because the first row has the column names
df = pd.read_csv(data_file, sep=' ')
# set the index to 'Atom'
df = df.set_index('Atom')

# header=None because the other_data doesn't have header info
other_df = pd.read_csv(other_data_file, sep=' ', header=None)
# set the column names since they're not specified in other_data
other_df.columns = ['Atom', 'predppm', 'some_other_field']
# set the index to 'Atom'
other_df = other_df.set_index('Atom')

# this will assign other_df['predppm'] to the correct rows,
# because pandas uses the index when assigning new columns
df['predppm'] = other_df['predppm']

print(df)
#         nVa    avgppm    stddev     delta  predppm
# Atom                                              
# 1.H1'     2  5.736490  0.001047  0.000001  7.65014
# 1.H2'     1  4.854380       NaN       NaN      NaN
# 1.H8      1  8.053670       NaN       NaN      NaN
# 10.H1'    3  5.338230  0.136655  0.018675      NaN
# 10.H2'    1  4.204490       NaN       NaN      NaN
# 10.H5     3  5.275713  0.231625  0.053650      NaN
# 10.H6     5  7.494850  0.028512  0.000813  7.52310

# if you want to return 'Atom' to being a column:
df = df.reset_index()