我有一个庞大的csv文件,其中的信息应放在列中,并且应以任何顺序将它们全部放在一起。我知道如何使用分隔符拆分列,但是当我这样做时,单列不包含一致的信息。示例是这样的:
Person Information
Mary Married: Yes
John Number of children: three, Married: No
Susan
Betty Do you like icecream?: Yes, Married: Yes, Number of chidren: four
Daniel Do you like icecream?: Sometimes, Number of chidren: two
Conrad Married: No, Do you like icecream?: No
Ofelia Married: No, Do you read?: Yes, Do you like icecream?: Some flavors
当我使用str.split拆分为列时,最终出现的列包含:
Yes
three
(empty space)
Yes (but this is the answer to another question)
Sometimes
等我想要的是一列,其中包含该人是否已婚的所有信息,另一列包含孩子的数量,另一列(如果他们喜欢冰淇淋),等等。
答案 0 :(得分:0)
我建议您通过处理文件的每一行来逐行读取文件。我尝试重新创建您提供给我们的示例,并使用以下代码片段来解析具有多个无序键的字符串:
result = pd.DataFrame()
#Example recreation
str = "Number of children: three, Married: No"
str1 = "Do you like icecream?: Yes, Married: Yes, Number of chidren: four"
str2 = "Married: No, Do you read?: Yes, Do you like icecream?: Some flavors"
strings = list([str1, str2, str])
for s in strings:
Dict = dict((x.strip(), y.strip()) for x, y in (element.split(':') for element in s.split(', ')))
result = result.append(Dict, ignore_index=True)
它将为您提供以下结果:
Do you like icecream? Married Number of chidren Do you read? Number of children
0 Yes Yes four NaN NaN
1 Some flavors No NaN Yes NaN
2 NaN No NaN NaN three
如您所见,示例中的“儿童”一词存在拼写错误。
答案 1 :(得分:0)
对数据格式进行很多假设。但是,如果数据的模式为NAME COLUMN_NAME: COLUMN_DATA, COLUMN_NAME: COLUMN_DATA
,则需要str.split()
来获取名称,str.split(', ')
来获取其他字段,而str.split(': ')
来获取每个列的名称和列值。
# read the csv lines
records = []
def process_text(text):
"""
text format: "NAME COLUMN_NAME: COLUMN_DATA, COLUMN_NAME: COLUMN_DATA"
"""
# separate NAME from other columns
data = text.split()
# create a dict for all the COLUMN_NAME: COLUMN_DATA values
fields = {
field[0]: field[1] for field in [field.split(': ') for field in ' '.join(data[1:]).split(', ')]
}
# add the NAME to the dict
fields['name']: data[0]
return fields
# process line by line and make a dataframe
pd.DataFrame([process_text(record) for record in records])