Question

我有一个大的.csv文件，我需要从中提取信息并将此信息添加到另一列。我的csv看起来像这样：

file_name,#,Date,Time,Temp (°C) ,Intensity
    trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,
    trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,
    trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,
    trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,
    trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,
    trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,
    trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,
    trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,
    trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,
    trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,

我想创建一个包含“file_name”列数据的两个新列。我想在文本“陷阱”之后提取一到两个数字，我想提取c或u并使用此数据创建新列。处理后数据看起来像这样：

file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
    trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
  trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
  trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
  trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
  trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
  trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
 trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12

我怀疑这样做的方法是使用awk和正则表达式，但我不确定如何实现正则表达式。如何提取一列的部分并将其附加到其他列？

Answer 1

使用sed即可：

sed -E '1s/.*/&,can_und,trap_no/; 2,$s/trap([0-9]+)([a-z]).*/&\2,\1/' file.csv

file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no    
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12

Answer 2

使用sed，这将是：

sed 's/trap\([[:digit:]]\+\)\(.\)\(.*\)$/trap\1\2\3\2,\1/' file

使用sed -i ...将其替换为文件。

Answer 3

gawk 方法：

awk -F, 'NR==1{ print $0,"can_und,trap_no" }
         NR>1{ match($1,/^trap([0-9]+)([a-z])/,a); print $0 a[2],a[1] }' OFS="," file

输出：

file_name,#,Date,Time,Temp (°C) ,Intensity,can_und,trap_no
trap12u_10733862_150809.txt,1,05/28/15,06:00:00.0,20.424,215.3,,u,12
trap12u_10733862_150809.txt,2,05/28/15,07:00:00.0,21.091,1,130.2,,u,12
trap12u_10733862_150809.txt,3,05/28/15,08:00:00.0,26.195,3,100.0,,u,12
trap11u_10733862_150809.txt,4,05/28/15,09:00:00.0,25.222,3,444.5,,u,11
trap11u_10733862_150809.txt,5,05/28/15,10:00:00.0,26.195,3,100.0,,u,11
trap11u_10733862_150809.txt,6,05/28/15,11:00:00.0,25.902,2,927.8,,u,11
trap11u_10733862_150809.txt,7,05/28/15,12:00:00.0,25.708,2,325.0,,u,11
trap12c_10733862_150809.txt,8,05/28/15,13:00:00.0,26.292,3,100.0,,c,12
trap12c_10733862_150809.txt,9,05/28/15,14:00:00.0,26.390,2,066.7,,c,12
trap12c_10733862_150809.txt,10,05/28/15,15:00:00.0,26.097,1,463.9,,c,12

NR==1{ print $0,"can_und,trap_no" } - 打印标题行
match($1,/^trap([0-9]+)([a-z])/,a) - 匹配trap字后面的数字和下一个后缀字母

Answer 4

使用python pandas reader因为python非常适合进行数值分析：

首先：我必须修改数据标题行，以便通过附加3个逗号来保持列的一致性：

file_name，＃，日期，时间，温度（°C），强度,,, 可能有一种方法可以告诉大熊猫忽略列差异 - 但我还是一个菜鸟。

Python代码，用于将数据读入列并创建2个名为“cu_int”和“cu_char”的新列，其中包含文件名的已解析元素：

import pandas

def main():
    df = pandas.read_csv("file.csv")

    df['cu_int'] = 0                                    # Add the new columns to the data frame.

    df['cu_char'] = ' '

    for index, df_row in df.iterrows():
        file_name = df['file_name'][index].strip()

        trap_string = file_name.split("_")[0]           # Get the file_name string prior to the underscore
        numeric_offset_beg = len("trap")                # Parse the number following the 'trap' string.
        numeric_offset_end = len(trap_string) - 1       # Leave off the 'c' or 'u' char.

        numeric_value = trap_string[numeric_offset_beg : numeric_offset_end]
        cu_value = trap_string[len(trap_string) - 1]

        df['cu_int'] = int(numeric_value)
        df['cu_char'] = cu_value

    # The pandas dataframe is ready for number crunching.
    # For now just print it out:
    print df


if __name__ == "__main__":
    main()

打印输出（注意发布的数据集中存在不一致 - 请参阅第1行作为示例）：

    $ python read_csv.py 
                         file_name  #      Date        Time  Temp (°C)     Intensity  Unnamed: 6  Unnamed: 7  Unnamed: 8  cu_int cu_char   
0      trap12u_10733862_150809.txt   1  05/28/15  06:00:00.0      20.424        215.3         NaN         NaN         NaN      12       c   
1      trap12u_10733862_150809.txt   2  05/28/15  07:00:00.0      21.091          1.0       130.2         NaN         NaN      12       c   
2      trap12u_10733862_150809.txt   3  05/28/15  08:00:00.0      26.195          3.0       100.0         NaN         NaN      12       c   
3      trap11u_10733862_150809.txt   4  05/28/15  09:00:00.0      25.222          3.0       444.5         NaN         NaN      12       c   
4      trap11u_10733862_150809.txt   5  05/28/15  10:00:00.0      26.195          3.0       100.0         NaN         NaN      12       c   
5      trap11u_10733862_150809.txt   6  05/28/15  11:00:00.0      25.902          2.0       927.8         NaN         NaN      12       c   
6      trap11u_10733862_150809.txt   7  05/28/15  12:00:00.0      25.708          2.0       325.0         NaN         NaN      12       c   
7      trap12c_10733862_150809.txt   8  05/28/15  13:00:00.0      26.292          3.0       100.0         NaN         NaN      12       c   
8      trap12c_10733862_150809.txt   9  05/28/15  14:00:00.0      26.390          2.0        66.7         NaN         NaN      12       c   
9      trap12c_10733862_150809.txt  10  05/28/15  15:00:00.0      26.097          1.0       463.9         NaN         NaN      12       c

如何提取一列的部分并将其附加到其他列？

4 个答案: