Question

我有一个很大的excel表，其中所有关于不同人员的数据都放在同一个单元格中。我拆分了数据，以便使用定界符。我使用（-）分隔不同的个人，并使用（;）分隔有关这些个人的信息。我想使用这些定界符将数据分为不同的列，但不是每个单元格都包含相同数量的人员信息，因此我无法使用固定数量的列。我需要根据已有的数据创建一个数据框。

这是我的数据的示例：

如您所见，在每个单元格中列出了不同数量的人员。我想要这样的最终输出：

此人的姓名总是跟在（-）之后，我只关心与姓名，职务和电子邮件相对应的每个人的前三个数据，其余都是多余的。我尝试在excel中的列上输入文本，并且删除了大多数行。另外，我尝试使用正则表达式按定界符进行分割，但由于必须定界列数，因此无法在多列中进行。

因此，我需要一个代码来遍历所有行，将信息按（-）分隔，并将第一个字符串放在第一列中（-）之后，将第二个字符串放在第二列中（;）之后，以及第三列（;）之后的第三个字符串，依此类推。由于某些单元只有一个成员而另一些具有多个成员，因此必须连续无限次。

谢谢

Answer 1

单挑。如果您尝试基于“-”定界，请注意该字符也会出现在其他位置，例如“联合创始人”。一种方法是首先处理这些实例，以使“-”仅出现在名称之前。正如您提到的那样，您想要一个熊猫DataFrame，可以使用一条apply语句来格式化每一行的信息：

import itertools

import pandas as pd


def format_records(row):
    """Split records to construct DataFrame"""

    # Replace 'Co-Founder' with 'CoFounder'. The '-' will cause the split command to think Founder is someone's name
    row = row[0].replace('Co-Founder', 'CoFounder').replace('Co-founder', 'CoFounder')

    # Split each record (one per person) using '-' as the delimiter
    records = row.split('-')[1:]

    # Split data constituting each record by ';' and return the first three elements
    elements = [r.split(';')[:3] for r in records]

    # Construct new row by joining the first three elements of each record
    new_row = list(itertools.chain.from_iterable(elements))

    # Correct for the previous co-founder conversion
    new_row = [r.replace('CoFounder', 'Co-Founder') for r in new_row]

    # Convert to series
    new_series = pd.Series(new_row)

    return new_series


if __name__ == '__main__':
    # Read in data
    df = pd.read_excel('data.xlsx', header=None)

    # Re-organise data
    new_df = df.apply(format_records, axis=1)

    # Number of times the ['Name', 'Title', 'Email'] sequence should repeat (based on number of columns of new_df)
    repetitions = int(new_df.shape[1] / 3)

    # Add column names
    new_df.columns = ['Name', 'Title', 'Email'] * repetitions

使用分隔符使用熊猫文本到不同的列

1 个答案: