Question

我正在尝试从https://drive.google.com/file/d/1leOUk4Z5xp9tTiFLpxgk_7KBv3xwn5eW/view的文件中导入数据进入熊猫数据框。我尝试使用

    data = pd.read_csv('data_engineering_assignment.txt',sep="|")

，但是我收到一条错误消息：“ ParserError：对数据进行标记错误。C错误：在231行中预期有9个字段，看到了10个” 我不想使用'error_bad_lines = False'并跳过数据行。

请帮助。

Answer 1

您的数据集中有问题，问题是有时我在description_text中找到|：例如，对于此ID 5d0c7c4c312ff75188d84954，您有| 在of A|X design中，因此熊猫将第二部分视为新列（这就是为什么您收到以下消息：Expected 9 fields, but saw 10 我希望这可以帮助您理解问题。

Answer 2

您可以指定列名称，说明有10个：

import pandas as pd

cols = ['_id','name','price','website_id','sku','url','brand','media','description_text','other']
dataframe = pd.read_csv('./data_engineering_assignment.txt', names=cols, sep='|' )
dataframe['description_text'] = dataframe['description_text'].map(str) + dataframe['other']
dataframe.to_csv('./data_engineering_assignment_v2.txt', index=False, sep=',')

由于熊猫必须猜测列数据类型，因此您会收到有关内存使用情况的警告，但是没关系

如何将.txt数据导入熊猫数据框？

2 个答案: