使用csv的大熊猫

时间:2016-11-07 19:10:21

标签: python csv pandas

我是Python的新手。我正在使用Pandas编辑csv文件,我找到了一个完成这项工作的功能。我想知道是否有人可以告诉我如何修改函数,以便不更新电子表格中的最后2列,这些列名为(ty Daniel Himmelstein)'Start_X'和'Start_Y'。我需要它留空格,以后会用新数据填写。谢谢,

启动电子表格的示例:

AK      MINE    VET                                     X       Y
1016649 0       90;59,180;26,270;39,0;9,270;20,0;17,    482547  1710874

需要如何格式化的示例:

AK      MINE    VET   VET_2     X       Y
1016649 0       90    59        482547  1710874
1016649 0      180    26
1016649 0      270    39 
1016649 0        0     9 
1016649 0      270    20
1016649 0        0    17

以下是代码:

def tidy_split(df, column, sep='|', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.

    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

2 个答案:

答案 0 :(得分:0)

使用csv模块可以更好地处理这类事情。 Pandas非常适合分析和操作数据,但在这种情况下,我会在加载到DataFrame之前正确格式化文件。

你可以,

import csv

# Where the new data will be stored
data = []

# Open up the csv file
with open('file.csv', 'r') as f:
    # Go through each row
    for i, row in enumerate(csv.reader(f)):
        if i == 0:
            continue
        # Break up the row based on the columns
        ak, mine, *vet, x, y = row

        # Get VET and VET_2
        v12 = [v.split(';') for v in vet]

        # Create new rows with split values of `vet`
        for j, (v1, v2) in enumerate(v12):
            if j == 0:
                new = [ak, mine, v1, v2, x, y]
            else:
                new = [ak, mine, v1, v2, None, None] # Leave blank spaces after first value

            data.append(new)

# Write out to a new csv file
with open('new_file.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    # Write header
    writer.writerow(['AK', 'MINE', 'VET', 'VET_2', 'X', 'Y'])
    # Write data
    writer.writerows(data)

使用此输入file.csv

AK,MINE,VET,,,,,,X,Y
1016649,0,90;59,180;26,270;39,0;9,270;20,0;17,482547,1710874

我得到以下输出new_file.csv

AK,MINE,VET,VET_2,X,Y
1016649,0,90,59,482547,1710874
1016649,0,180,26,,
1016649,0,270,39,,
1016649,0,0,9,,
1016649,0,270,20,,
1016649,0,0,17,,

将csv文件格式化为“正确”后,加载到pandas会更容易。

答案 1 :(得分:0)

您正在寻找的方法是DataFrame.stack(),它会将DataFrame的形状从一行中的所有“vets”更改为单独一行中的每个“vet”。

一旦形状正确,您可以继续进一步分割数据。这应该让你开始:

s = df.VET.str.split(",").apply(pd.Series).stack()
s.index = s.index.droplevel(-1)
s = s.apply(lambda x: pd.Series(x.split(";"))).dropna()
result = df.drop("VET", axis=1).join(s)