Question

我有一个像这样的数据框：

>>> import pandas as pd

>>> pd.read_csv('csv/10_no_headers_with_com.csv')
                  //field  field2
0   //first field is time     NaN
1                 132605     1.0
2                 132750     2.0
3                 132772     3.0
4                 132773     4.0
5                 133065     5.0
6                 133150     6.0

我想添加另一个字段，该字段说明第一个字段的第一个值是否为注释字符//。到目前为止，我有这样的事情：

# may not have a heading value, so use the index not the key
df[0].str.startswith('//')

用此值添加新列的正确方法是什么，以便结果类似于：

pd>>> pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
                       0       1       _starts_with_comment
0                 //field  field2       True
1  //first field is time     NaN       True
2                 132605       1       False
3                 132750       2       False
4                 132772       3       False

Answer 1

一种方法是利用pd.to_numeric，假设第一列必须中的非数字数据表示注释：

df = pd.read_csv('csv/10_no_headers_with_com.csv', header=None)
df['_starts_with_comment'] = pd.to_numeric(df[0], errors='coerce').isnull()

请注意，强烈建议不要在系列中使用这种混合类型。您的前两个系列将不再支持向量化操作，因为它们将存储在object dtype系列中。您会失去熊猫的一些主要好处。

一个更好的主意是使用csv模块将那些属性提取到文件顶部，并将它们存储为单独的变量。 Here's an example介绍如何实现这一目标。

Answer 2

仅分配给新列的命令有什么问题？：

df['comment_flag'] = df[0].str.startswith('//')

还是您确实有jpp提到的混合类型列？

编辑：
我不太确定，但是从您的评论中我得到的印象是您真的不需要额外的评论标志列。万一您想将没有注释的数据加载到数据框中，但仍将注释标题中有些隐藏的字段名称用作列名，则可能需要检查一下：
因此，基于此文本文件：

//field  field2
//first field is time     NaN
132605     1.0
132750     2.0
132772     3.0
132773     4.0
133065     5.0
133150     6.0

您可以这样做：

cmt = '//'

header = []
with open(textfilename, 'r') as f:
    for line in f:
        if line.startswith(cmt):
            header.append(line)
        else:                      # leave that out if collecting all comments of entire file is ok/wanted
            break
print(header)
# ['//field  field2\n', '//first field is time     NaN\n']

通过这种方式，您可以准备要用于例如列名称。
从第一行标题中获取名称并将其用于熊猫导入就像

nms = header[0][2:].split()
df = pd.read_csv(textfilename, comment=cmt, names=nms, sep='\s+ ', engine='python')

    field  field2                                           
0  132605     1.0                                         
1  132750     2.0                                       
2  132772     3.0                                      
3  132773     4.0                                       
4  133065     5.0                                       
5  133150     6.0

Answer 3

尝试一下：

import pandas as pd
import numpy as np

df.loc[:,'_starts_with_comment'] = np.where(df[0].str.startswith(r'//'), True, False)

根据行中的第一个值将新列添加到数据框

3 个答案: