我正在尝试解析此CSV数据,该数据在每行的末尾都有引号,引号之间的格式不正常,分号之间。
我无法使用熊猫正确解析此文件。
这是数据的链接(pastebin由于某种原因无法识别为text / csv,因此选择了任何随机格式,请忽略该内容)
https://paste.gnome.org/pr1pmw4w2
我尝试通过仅给文件名作为参数来尝试使用“,”作为分隔符,并正常调用pandas数据框对象构造。
header = ["Organization_Name","Organization_Name_URL","Categories","Headquarters_Location","Description","Estimated_Revenue_Range","Operating_Status","Founded_Date","Founded_Date_Precision","Contact_Email","Phone_Number","Full_Description","Investor_Type","Investment_Stage","Number_of_Investments","Number_of_Portfolio_Organizations","Accelerator_Program_Type","Number_of_Founders_(Alumni)","Number_of_Alumni","Number_of_Funding_Rounds","Funding_Status","Total_Funding_Amount","Total_Funding_Amount_Currency","Total_Funding_Amount_Currency_(in_USD)","Total_Equity_Funding_Amount","Total_Equity_Funding_Amount_Currency","Total_Equity_Funding_Amount_Currency_(in_USD)","Number_of_Lead_Investors","Number_of_Investors","Number_of_Acquisitions","Transaction_Name","Transaction_Name_URL","Acquired_by","Acquired_by_URL","Announced_Date","Announced_Date_Precision","Price","Price_Currency","Price_Currency_(in_USD)","Acquisition_Type","IPO_Status,Number_of_Events","SimilarWeb_-_Monthly_Visits","Number_of_Founders","Founders","Number_of_Employees"]
pd.read_csv("data.csv", sep=",", encoding="utf-8", names=header)
答案 0 :(得分:0)
首先,您可以正常读取数据。现在所有数据将在第一列中。您可以使用pyparsing模块根据','进行拆分,并将其分配回去。我希望这可以解决您的查询。您只需要对所有行执行此操作。
import pyparsing as pp
import pandas as pd
df = pd.read_csv('input.csv')
df.loc[0] = pp.commaSeparatedList.parseString(df['Organization Name'][0]).asList()
输出
df #(since there are 42 columns, pasting just a snipped)