Question

我正在尝试使用pandas read_csv读取大文件（~8Gb）。在数据的一个列中，有时会有一个包含逗号的列表，但它用大括号括起来，例如

。

“label1的”， “LABEL2”， “LABEL3”， “label4”， “label5”

“{A1}”，“2”，“”，“False”，“{”apple“：false，”pear“：false，”banana“：null}

因此，当读入这些特定行时，我收到错误“错误标记数据.C错误：第35行预期37个字段，看到42”。我找到了this解决方案，据说补充说 sep =“，（？！[^ {] *}）”进入read_csv参数，这些参数可以正确地分割数据。但是，数据现在包含每个条目周围的引号（这在我添加sep参数之前没有发生）。

现在数据看起来像这样：

“label1”“label2”“label3”“label4”“label5”

“{A1}”“2”“”“False”“{”apple“：false，”pear“：false，”banana“：null}”

意思是我不能在数字数据上使用.describe（）等，因为它们仍然是字符串。

有没有人知道如何在没有引号的情况下阅读它，但仍然将数据拆分到原来的位置？

非常陌生，如果有明显的解决方案，那么道歉。

serialdev找到了删除“s”的解决方案，但数据列是对象而不是我期望/想要的，例如整数值不被视为整数。

数据需要以“，”显式拆分（包括“s”），有没有办法在read_csv参数中说明？

谢谢！

Answer 1

这是否有效，因为您拥有所需的所有数据：

.map(lambda x: x.lstrip('\"').rstrip('\"'))

因此，之后只需清理"的所有事件

使用示例编辑：

mydata = [{'"first_name"' : '"bill', 'age': '"75"'},
          {'"first_name"' : '"bob', 'age': '"7"'},
          {'"first_name"' : '"ben', 'age': '"77"'}]
IN: df = pd.DataFrame(mydata)
OUT:
  "first_name"   age
0        "bill  "75"
1         "bob   "7"
2         "ben  "77"

IN: df['"first_name"'] = df['"first_name"'].map(lambda x: x.lstrip('\"').rstrip('\"'))
OUT:
0    bill
1     bob
2     ben
Name: "first_name", dtype: object

选择列后使用此序列，这不是理想的，但将完成工作：

.map(lambda x: x.lstrip('\"').rstrip('\"'))

使用此模式后，您可以更改Dtypes：

df['col'].apply(lambda x: pd.to_numeric(x, errors='ignore'))

或简单地说：

df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)

Answer 2

如果需要从列中删除"，请使用向量化函数str.strip：

import pandas as pd

mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
          {'"first_name"': '"Bob"', '"age"': '"8"'},
          {'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
print (df)
  "age" "first_name"
0   "7"       "Bill"
1   "8"        "Bob"
2   "9"        "Ben"

df['"first_name"'] = df['"first_name"'].str.strip('"')
print (df)
  "age" "first_name"
0   "7"         Bill
1   "8"          Bob
2   "9"          Ben

如果需要将函数str.strip()应用于所有列，请使用：

df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
df.columns = df.columns.str.strip('"')
print (df)
  age first_name
0   7       Bill
1   8        Bob
2   9        Ben

<强>计时：

mydata = [{'"first_name"': '"Bill"', '"age"': '"7"'},
          {'"first_name"': '"Bob"', '"age"': '"8"'},
          {'"first_name"': '"Ben"', '"age"': '"9"'}]
df = pd.DataFrame(mydata)
df = pd.concat([df]*3, axis=1)
df.columns = ['"first_name1"','"age1"','"first_name2"','"age2"','"first_name3"','"age3"']
#create sample [300000 rows x 6 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
df1,df2 = df.copy(),df.copy()

def a(df):
    df.columns = df.columns.str.strip('"')
    df['age1'] = df['age1'].str.strip('"')
    df['first_name1'] = df['first_name1'].str.strip('"')
    df['age2'] = df['age2'].str.strip('"')
    df['first_name2'] = df['first_name2'].str.strip('"')
    df['age3'] = df['age3'].str.strip('"')
    df['first_name3'] = df['first_name3'].str.strip('"')
    return df

def b(df):
    #apply  str function to all columns in dataframe
    df = pd.concat([df[col].str.strip('"') for col in df], axis=1)
    df.columns = df.columns.str.strip('"')
    return df

def c(df):
    #apply  str function to all columns in dataframe
    df = df.applymap(lambda x: x.lstrip('\"').rstrip('\"')) 
    df.columns = df.columns.str.strip('"')
    return df

print (a(df))
print (b(df1))
print (c(df2))

In [135]: %timeit (a(df))
1 loop, best of 3: 635 ms per loop

In [136]: %timeit (b(df1))
1 loop, best of 3: 728 ms per loop

In [137]: %timeit (c(df2))
1 loop, best of 3: 1.21 s per loop

Answer 3

读入您指定的数据结构，其中最后一个元素是未知长度。

＆＃34; {A1}＆＃34;，＆＃34; 2＆＃34;，＆＃34;＆＃34;，＆＃34; False＆＃34;，＆＃34; {＆＃34 ;苹果＆＃34; ：假，＆＃34;梨＆＃34; ：false，＆＃34; banana＆＃34; ：null}＆＃34;

＆＃34; {A1}＆＃34;，＆＃34; 2＆＃34;，＆＃34;＆＃34;，＆＃34; False＆＃34;，＆＃34; {＆＃34 ;苹果＆＃34; ：假，＆＃34;梨＆＃34; ：false，＆＃34; banana＆＃34; ：null，＆＃34; orange＆＃34;：＆＃34; true＆＃34;}＆＃34;

使用负前向前瞻断言将单独更改为正则表达式。这样您就可以分开＆＃39;＆＃39;只有当没有立即跟随空间时。

df = pd.read_csv('my_file.csv', sep='[,](?!\s)', engine='python', thousands='"')

print df

        0  1   2        3                                                  4
0  "{A1}"  2 NaN  "False"  "{ "apple" : false, "pear" : false, "banana" :...
1  "{A1}"  2 NaN  "False"  "{ "apple" : false, "pear" : false, "banana" :...

指定千位分隔符作为引用是一种解析字段的hackie方式，它包含一个引用的整数到正确的数据类型。您可以使用转换器获得相同的结果，如果需要转换器也可以从字符串中删除引号并转换为＆＃34; True＆＃34;或＆＃34;错误＆＃34;到布尔值。

Answer 4

这取决于您的文件。你是否在单元格中检查了数据是否有逗号？如果你喜欢这种香蕉：水果，热带，可食用等，在同一个细胞中，你会得到这种虫子。基本解决方案之一是删除文件中的所有逗号。或者，如果您可以阅读它，则可以删除特殊字符：

 >>>df
                 Banana
 0  Hello, Salut, Salom
 1              Bonjour


 >>>df['Banana'] = df['Banana'].str.replace(',','')
 >>>df
               Banana
 0  Hello Salut Salom
 1            Bonjour

使用read_csv时额外的逗号导致数据框中的s太多

4 个答案: