我有一个csv,我试图加载到熊猫。 csv中有三列,由管道分隔。前两列是整数,第三列是字符串。数据中存在不规则性,因此某些字符串以空格开头而有些则不是。我处于这样的情况:必须保留那些前导空白区域以用于稍后的处理步骤,但是,似乎pandas剥离它。任何帮助将不胜感激!
示例数据:
1|2|Dogs are better than cats!
1|4| Cats are superior to dogs.
2|3|Birds Rule. More than you think! #birdsrule
2|10|Birds birds birds
我已经尝试了read_csv函数并构建了自己的解析器,两者都无济于事。以下是我的尝试:
read_csv:
my_df=pd.read_csv("foo.txt", sep="|", dtype=str, names=['num1','num2','some_Text'], encoding = 'utf8', skipinitialspace=False)
我自己的解析器:
my_df = []
with open("foo.txt", "r") as data:
for row in data:
num1, num2, some_text = row.split("|")
some_text = some_text.strip("\n")
my_df.append(
pd.DataFrame({
"num1": [num1],
"num2": [num2],
"some_text": [some_text]
})
)
my_df = pd.concat(my_df)
答案 0 :(得分:2)
您的代码应该可以正常运行。
In [17]: df = pd.read_csv("foo.txt", sep="|", dtype=str, names=['num1','num2','some_Text'], encoding = 'utf8', skipinitialspace=False)
In [18]: df
Out[18]:
num1 num2 some_Text
0 1 2 Dogs are better than cats!
1 1 4 Cats are superior to dogs.
2 2 3 Birds Rule. More than you think! #birdsrule
3 2 10 Birds birds birds
In [19]: df.values
Out[19]:
array([['1', '2', 'Dogs are better than cats!'],
['1', '4', ' Cats are superior to dogs.'],
['2', '3', 'Birds Rule. More than you think! #birdsrule'],
['2', '10', 'Birds birds birds']], dtype=object)
请注意保留Cats之前的空格,尽管由于字符串列是右对齐的,否则您可能会被愚弄,不这样做。
In [24]: df["some_Text"][1]
Out[24]: ' Cats are superior to dogs.'
它也应该有效,并且通过更简单的调用,即pd.read_csv("foo.txt", sep="|", names=['num1','num2','some_Text'])
来适当地处理类型(让num1和num2成为整数)。
答案 1 :(得分:2)
我表明它有效...更多的是skipinitialspace=False
是默认值。
import pandas as pd
from io import StringIO
txt = """1|2|Dogs are better than cats!
1|4| Cats are superior to dogs.
2|3|Birds Rule. More than you think! #birdsrule
2|10|Birds birds birds
"""
df = pd.read_csv(StringIO(txt), sep='|', header=None)
# get first character of third column
df.iloc[:, 2].str[0]
0 D
1
2 B
3 B
Name: 2, dtype: object