我正在尝试将.txt文件作为熊猫数据框读取,但是我收到了多个错误且数据未加载。我发现的问题与数据结构有关。
file.txt:
"Mark","Company","Country","Value","1","abcdef","ecu","1000","","","usa","30","","","col","200"....
file.txt显示如下信息:
Mark Company Country Value ...
1 abcdef ecu 1000 ...
usa 30 ...
col 200 ...
2 ghijk jap 10 ...
eur 900 ...
lki ...
3 lmnop wer 21 ...
uye ...
urg 123 ...
. . . . .
. . . . .
我需要的是一个结构与此类似的数据框:
Mark Company Country Value ...
1 abcdef ecu 1000 ...
1 abcdef usa 30 ...
1 abcdef col 200 ...
2 ghijk jap 10 ...
2 ghijk eur 900 ...
2 ghijk lki 0 ...
3 lmnop wer 21 ...
3 lmnop uye 0 ...
3 lmnop urg 123 ...
. . . . .
. . . . .
答案 0 :(得分:1)
<强>更新强>
df = pd.read_csv(fn,
encoding='utf-16',
na_values=['NA','NaN','nan','n.a.'],
low_memory=False)
# list here ALL columns that must be filled, using `ffill()` method:
cols = ['Mark','Company name','Cons. code','City']
df[cols] = df[cols].ffill()
# assuming that we have `ffilled` all required columns, we can simply `fillna(0)` for the rest of the columns
df = df.fillna(0)
OLD回答:
您的文件看起来像一个固定宽度的文件,因此请尝试将pd.read_fwf与DataFrame.ffill()
结合使用假设我们有以下TXT文件:
Mark Company Country Value1 Value2
1 abcdef ecu 1000
usa 30 10
col 200 20
2 ghijk jap 10
eur 900 30
lki 40
3 lmnop wer 21
uye 50
urg 123
解决方案:
In [102]: fn = r'D:\temp\.data\002.txt'
In [103]: df = pd.read_fwf(fn)
In [123]: df.loc[:, df.columns.str.contains(r'^Value')] = df.filter(regex=r'^Value').fillna(0)
In [124]: df = df.ffill()
In [125]: df
Out[125]:
Mark Company Country Value1 Value2
0 1.0 abcdef ecu 1000.0 0.0
1 1.0 abcdef usa 30.0 10.0
2 1.0 abcdef col 200.0 20.0
3 2.0 ghijk jap 10.0 0.0
4 2.0 ghijk eur 900.0 30.0
5 2.0 ghijk lki 0.0 40.0
6 3.0 lmnop wer 21.0 0.0
7 3.0 lmnop uye 0.0 50.0
8 3.0 lmnop urg 123.0 0.0