无法使用熊猫解析字符串引用的csv数据

时间:2019-09-14 17:58:40

标签: python pandas csv dataframe

我正在尝试解析此CSV数据,该数据在每行的末尾都有引号,引号之间的格式不正常,分号之间。

我无法使用熊猫正确解析此文件。

这是数据的链接(pastebin由于某种原因无法识别为text / csv,因此选择了任何随机格式,请忽略该内容)

https://paste.gnome.org/pr1pmw4w2

我尝试通过仅给文件名作为参数来尝试使用“,”作为分隔符,并正常调用pandas数据框对象构造。

header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]

pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)

1 个答案:

答案 0 :(得分:0)

首先,您可以正常读取数据。现在所有数据将在第一列中。您可以使用pyparsing模块根据','进行拆分,并将其分配回去。我希望这可以解决您的查询。您只需要对所有行执行此操作。

import pyparsing as pp
import pandas as pd

df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()

输出

df #(since there are 42 columns, pasting just a snipped)

data after assigning