Python-熊猫-按结构不良的csv文件进行组织

时间:2019-12-06 17:38:50

标签: python pandas csv

我有一个庞大的csv文件,其中的信息应放在列中,并且应以任何顺序将它们全部放在一起。我知道如何使用分隔符拆分列,但是当我这样做时,单列不包含一致的信息。示例是这样的:

Person  Information
Mary    Married: Yes
John    Number of children: three, Married: No
Susan
Betty   Do you like icecream?: Yes, Married: Yes, Number of chidren: four
Daniel  Do you like icecream?: Sometimes, Number of chidren: two
Conrad  Married: No, Do you like icecream?: No
Ofelia  Married: No, Do you read?: Yes, Do you like icecream?: Some flavors

当我使用str.split拆分为列时,最终出现的列包含:

Yes
three
(empty space)
Yes (but this is the answer to another question)
Sometimes

等我想要的是一列,其中包含该人是否已婚的所有信息,另一列包含孩子的数量,另一列(如果他们喜欢冰淇淋),等等。

2 个答案:

答案 0 :(得分:0)

我建议您通过处理文件的每一行来逐行读取文件。我尝试重新创建您提供给我们的示例,并使用以下代码片段来解析具有多个无序键的字符串:

result = pd.DataFrame()

#Example recreation
str = "Number of children: three, Married: No"
str1 = "Do you like icecream?: Yes, Married: Yes, Number of chidren: four"
str2 = "Married: No, Do you read?: Yes, Do you like icecream?: Some flavors"

strings = list([str1, str2, str])

for s in strings:
    Dict = dict((x.strip(), y.strip()) for x, y in (element.split(':') for element in s.split(', ')))
    result = result.append(Dict, ignore_index=True)

它将为您提供以下结果:

  Do you like icecream? Married Number of chidren Do you read? Number of children
0                   Yes     Yes              four          NaN                NaN
1          Some flavors      No               NaN          Yes                NaN
2                   NaN      No               NaN          NaN              three

如您所见,示例中的“儿童”一词存在拼写错误。

答案 1 :(得分:0)

对数据格式进行很多假设。但是,如果数据的模式为NAME COLUMN_NAME: COLUMN_DATA, COLUMN_NAME: COLUMN_DATA,则需要str.split()来获取名称,str.split(', ')来获取其他字段,而str.split(': ')来获取每个列的名称和列值。

# read the csv lines
records = []


def process_text(text):
    """
    text format: "NAME COLUMN_NAME: COLUMN_DATA, COLUMN_NAME: COLUMN_DATA"
    """
    # separate NAME from other columns
    data = text.split()
    # create a dict for all the COLUMN_NAME: COLUMN_DATA values
    fields = {
        field[0]: field[1] for field in [field.split(': ') for field in ' '.join(data[1:]).split(', ')]
    }
    # add the NAME to the dict
    fields['name']: data[0]

    return fields

# process line by line and make a dataframe
pd.DataFrame([process_text(record) for record in records])