Question

我有一个csv，我试图加载到熊猫。 csv中有三列，由管道分隔。前两列是整数，第三列是字符串。数据中存在不规则性，因此某些字符串以空格开头而有些则不是。我处于这样的情况：必须保留那些前导空白区域以用于稍后的处理步骤，但是，似乎pandas剥离它。任何帮助将不胜感激！

示例数据：

1|2|Dogs are better than cats!
1|4| Cats are superior to dogs.    
2|3|Birds Rule. More than you think! #birdsrule
2|10|Birds birds birds

我已经尝试了read_csv函数并构建了自己的解析器，两者都无济于事。以下是我的尝试：

read_csv：

my_df=pd.read_csv("foo.txt", sep="|", dtype=str, names=['num1','num2','some_Text'], encoding = 'utf8', skipinitialspace=False)

我自己的解析器：

my_df = []

with open("foo.txt", "r") as data:
    for row in data:
        num1, num2, some_text = row.split("|")
        some_text = some_text.strip("\n")
        my_df.append(
            pd.DataFrame({
                "num1": [num1],
                "num2": [num2],
                "some_text": [some_text]
            })
        )
my_df = pd.concat(my_df)

Answer 1

您的代码应该可以正常运行。

In [17]: df = pd.read_csv("foo.txt", sep="|", dtype=str, names=['num1','num2','some_Text'], encoding = 'utf8', skipinitialspace=False)

In [18]: df
Out[18]: 
  num1 num2                                    some_Text
0    1    2                   Dogs are better than cats!
1    1    4                   Cats are superior to dogs.
2    2    3  Birds Rule. More than you think! #birdsrule
3    2   10                            Birds birds birds

In [19]: df.values
Out[19]: 
array([['1', '2', 'Dogs are better than cats!'],
       ['1', '4', ' Cats are superior to dogs.'],
       ['2', '3', 'Birds Rule. More than you think! #birdsrule'],
       ['2', '10', 'Birds birds birds']], dtype=object)

请注意保留Cats之前的空格，尽管由于字符串列是右对齐的，否则您可能会被愚弄，不这样做。

In [24]: df["some_Text"][1]
Out[24]: ' Cats are superior to dogs.'

它也应该有效，并且通过更简单的调用，即pd.read_csv("foo.txt", sep="|", names=['num1','num2','some_Text'])来适当地处理类型（让num1和num2成为整数）。

Answer 2

我表明它有效...更多的是skipinitialspace=False是默认值。

import pandas as pd
from io import StringIO

txt = """1|2|Dogs are better than cats!
1|4| Cats are superior to dogs.    
2|3|Birds Rule. More than you think! #birdsrule
2|10|Birds birds birds
"""

df = pd.read_csv(StringIO(txt), sep='|', header=None)

# get first character of third column
df.iloc[:, 2].str[0]

0    D
1     
2    B
3    B
Name: 2, dtype: object

在将字符串读入pandas数据框时保留前导空格？

2 个答案: