Question

以下代码：

import pandas as pd

from StringIO import StringIO
data = StringIO("""a,b,c
               1,2,3
               4,5,6
               6,7,8,9
               1,2,5
               3,4,5""")
pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False)

生成此输出：

Skipping line 4: expected 3 fields, saw 4

   a  b  c
0  1  2  3
1  4  5  6
2  1  2  5
3  3  4  5

即，第三行被拒绝，因为它包含四个（而不是预期的三个）值。此csv数据文件被视为格式错误。

如果我想要一个不同的行为，即不跳过具有比预期更多的字段的行，但是通过使用更大的数据帧来保持它们的值，该怎么办？

在给定的示例中，这将是行为（'UNK'只是一个示例，可能是任何其他字符串）：

   a  b  c UNK
0  1  2  3 nan
1  4  5  6 nan
2  6  7  8  9
3  1  2  5 nan 
4  3  4  5 nan

这只是一个例子，其中只有一个附加值，那么任意（和先验未知）数量的字段呢？这可以通过pandas read_csv以某种方式获得吗？

请注意：我可以使用csv.reader执行此操作，我只想尝试切换到pandas。

感谢任何帮助/提示。

Answer 1

在阅读csv

时，您似乎需要名称参数

import pandas as pd

from StringIO import StringIO
data = StringIO("""a,b,c
               1,2,3
               4,5,6
               6,7,8,9
               1,2,5
               3,4,5""")
df = pd.read_csv(data, warn_bad_lines=True, error_bad_lines=False, names = ["a", "b", "c", "UNK"])

print(df)

<强>输出：

                  a  b  c  UNK
0                 a  b  c  NaN
1                 1  2  3  NaN
2                 4  5  6  NaN
3                 6  7  8  9.0
4                 1  2  5  NaN
5                 3  4  5  NaN

Answer 2

假设Afile.csv包含：

a,b,c#Incomplete Header
1,2,3
4,5,6
6,7,8,9
1,2,5
3,4,5,,8

以下函数生成一个包含所有字段的DataFrame：

def readRawValuesFromCSV(file1, separator=',', commentMark='#'):
    df = pd.DataFrame()

    with open(file1, 'r') as f:
        for line in f:
            b = line.strip().split(commentMark)
            if len(b[0])>0:
                lineList = tuple(b[0].strip().split(separator))
                df = pd.concat( [df, pd.DataFrame([lineList])], ignore_index=True )
    return df

您可以使用以下代码进行测试：

file1 = 'Afile.csv'
# Read all values of a (maybe malformed) CSV file
df = readRawValuesFromCSV (file1, ',', '#')

产量：

df
   0  1  2    3    4
0  a  b  c  NaN  NaN
1  1  2  3  NaN  NaN
2  4  5  6  NaN  NaN
3  6  7  8    9  NaN
4  1  2  5  NaN  NaN
5  3  4  5         8

我很感激herrfz的回答 Handling Variable Number of Columns with Pandas - Python。目前的问题可能是对方的概括。

读取csv时覆盖“默认”字段数

2 个答案: