Question

我试图通过在列标题中搜索字符串来从更大的数据框中构建数据框的子集。

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
    wellname = well
    well = pd.DataFrame()
    well_cols = [col for col in cdf.columns if wellname in col]
    well = cdf[well_cols]

我正在尝试在cdf数据帧列中搜索井名，并将包含该井名的那些列放到名为井名的新数据框中。

我能够构建新的子数据帧，但是当cdf为（21973，91）时，数据帧的大小为（0，0）。

well_cols也可以正确地作为列表填充。

这些是cdf列标题中的一些。每列有2万行数据。

Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
       'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
       'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
       'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
       'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
       'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
       'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
       'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
       'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
       'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',

我想创建一个新的数据框，使每个包含“井” IE的标题都为所有列和列名包含N1的数据和另一个N2等的数据创建新的数据框。

新数据帧在循环内部时正确填充，而在循环中断时消失... print(well)的部分代码输出：

[27884 rows x 10 columns]
       N9_Inj_Casing_Gas_Valve  ...  N9_Inj_Casing_Gas_Pressure
0                    74.375000  ...                 2485.602364
1                    74.520833  ...                 2485.346000
2                    74.437500  ...                 2485.341091

Answer 1

IIUC这应该足够了：

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:

    well_cols = [col for col in cdf.columns if well in col]
    well_dict[well] = cdf[well_cols]

如果要填充某些东西，通常使用字典。那么，在这种情况下，如果输入well_dict['N1']，则会得到第一个数据帧，依此类推。

Answer 2

在数组上进行迭代时，它们的元素不可更改。也就是说，这是根据您的示例执行的操作：

result = [
    [1, 11, 21], [2, 12, 22], [3, 13, 23], [4, 14, 24],
    [5, 15, 25], [6, 16, 26], [7, 17, 27], [8, 18, 28],
    [9, 19, 29], [10, 20, 30], [11, 21, 31], [12, 22, 32]
]

但是您绝对不会更改数组或存储新的数据帧（尽管在迭代结束时仍将最后一个数据帧存储在# 1st iteration well = 'N1' # assigned by the for loop directive ... well = <empty DataFrame> # assigned by `well = pd.DataFrame()` ... well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]` # 2nd iteration well = 'N2' # assigned by the for loop directive ... well = <empty DataFrame> # assigned by `well = pd.DataFrame()` ... well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]` ...中）。

IMO，看来将数据帧存储在dict中会更容易使用：

well

但是，如果您确实希望将其包含在列表中，则可以执行以下操作：

df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)

wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
    well_cols = [col for col in cdf.columns if well in col]
    well_dfs[well] = cdf[well_cols]

Answer 3

解决问题的一种方法是使用pd.MultiIndex和Groupby。

您可以添加由孔标识符和变量名组成的MultiIndex结构。如果您有df：

   N1_a  N1_b  N2_a  N2_b
1     2     2     3     4
2     7     8     9    10

您可以使用df.columns.str.split('_', expand=True)来解析井标识符对应的变量名称（即a或b）。

df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

哪个返回：

  N1    N2    
   a  b  a   b
0  2  2  3   4
1  7  8  9  10

然后，您可以转置数据帧和groupby MultiIndex级别0。

grouped = df.T.groupby(level=0)

要返回未转置子数据帧的列表，可以使用：

wells = [group.T for _, group in grouped]

其中wells[0]是：

和wells[1]是：

最后一步是不必要的，因为可以从分组对象grouped访问数据。

一起：

import pandas as pd
from io import StringIO

data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""

df = pd.read_csv(StringIO(data)) 

# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)

# Group by well name
grouped = df.T.groupby(level=0)

#bulist list of sub dataframes
wells = [group.T for _, group in grouped]

Answer 4

使用contains

df[df.columns.str.contains('|'.join(wells))]

如何在循环中填充熊猫数据框？

4 个答案: