Question

我有\x02\n作为我要解析的csv文件中的行终止符。但是，我不能在熊猫中使用两个字符，而只能使用一个，例如：

>>> data = pd.read_csv(file, sep="\x01", lineterminator="\x02")
>>> data.loc[100].tolist()
['\n1475226000146', '1464606', 'Juvenile', '1', 'http://itunes.apple.com/artist/juvenile/id1464606?uo=5', '1']

或者：

data = pd.read_csv(file, sep="\x01", lineterminator="\n")
 >>> data.loc[100].tolist()
['1475226000146', '1464606', 'Juvenile', '1', 'http://itunes.apple.com/artist/juvenile/id1464606?uo=5', '1\x02']

在这里我们可以看到\n没有被正确地切掉。使用上述分隔符在熊猫中读取csv文件的最佳方法是什么？

Answer 1

从v0.23开始，pandas不支持多字符行终止符。您的代码当前返回：

s = "this\x01is\x01test\x02\nthis\x01is\x01test2\x02"
df = pd.read_csv(
    pd.compat.StringIO(s), sep="\x01", lineterminator="\x02", header=None)

df
        0   1      2
0    this  is   test
1  \nthis  is  test2

（到目前为止）您唯一的选择是从第一列中删除前导空格。您可以使用str.lstrip来做到这一点。

df.iloc[:, 0] = df.iloc[:, 0].str.lstrip()
# Alternatively,
# df.iloc[:, 0] = [s.lstrip() for s in df.iloc[:, 0]]

df

      0   1      2
0  this  is   test
1  this  is  test2

如果您必须处理多种其他行终止符（仅换行符）的剥离，则可以传递其中的字符串：

line_terminators = ['\n', ...]
df.iloc[:, 0] = df.iloc[:, 0].str.lstrip(''.join(line_terminators))

如何处理熊猫中的多值限定符

1 个答案: