Question

我正在尝试将.txt文件作为熊猫数据框读取，但是我收到了多个错误且数据未加载。我发现的问题与数据结构有关。

file.txt:
    "Mark","Company","Country","Value","1","abcdef","ecu","1000","","","usa","30","","","col","200"....

file.txt显示如下信息：

Mark          Company     Country     Value   ...
   1           abcdef         ecu      1000   ...
                              usa        30   ...
                              col       200   ...
   2           ghijk          jap        10   ...
                              eur       900   ...
                              lki             ...
   3           lmnop          wer        21   ...
                              uye             ...
                              urg       123   ...
   .               .            .         .     .     
   .               .            .         .     .

我需要的是一个结构与此类似的数据框：

Mark          Company     Country     Value   ...
   1           abcdef         ecu      1000   ...
   1           abcdef         usa        30   ...
   1           abcdef         col       200   ...
   2           ghijk          jap        10   ...
   2           ghijk          eur       900   ...
   2           ghijk          lki         0   ...
   3           lmnop          wer        21   ...
   3           lmnop          uye         0   ...
   3           lmnop          urg       123   ...
   .               .            .         .     .     
   .               .            .         .     .

Answer 1

<强>更新

df = pd.read_csv(fn,
                 encoding='utf-16',
                 na_values=['NA','NaN','nan','n.a.'],
                 low_memory=False)

# list here ALL columns that must be filled, using `ffill()` method:
cols = ['Mark','Company name','Cons. code','City']
df[cols] = df[cols].ffill()

# assuming that we have `ffilled` all required columns, we can simply `fillna(0)` for the rest of the columns
df = df.fillna(0)

OLD回答：

您的文件看起来像一个固定宽度的文件，因此请尝试将pd.read_fwf与DataFrame.ffill()

结合使用

假设我们有以下TXT文件：

Mark          Company     Country     Value1  Value2 
   1           abcdef         ecu      1000      
                              usa        30      10  
                              col       200      20  
   2           ghijk          jap        10        
                              eur       900      30  
                              lki                40    
   3           lmnop          wer        21        
                              uye               50     
                              urg       123

解决方案：

In [102]: fn = r'D:\temp\.data\002.txt'

In [103]: df = pd.read_fwf(fn)

In [123]: df.loc[:, df.columns.str.contains(r'^Value')] = df.filter(regex=r'^Value').fillna(0)

In [124]: df = df.ffill()

In [125]: df
Out[125]:
   Mark Company Country  Value1  Value2
0   1.0  abcdef     ecu  1000.0     0.0
1   1.0  abcdef     usa    30.0    10.0
2   1.0  abcdef     col   200.0    20.0
3   2.0   ghijk     jap    10.0     0.0
4   2.0   ghijk     eur   900.0    30.0
5   2.0   ghijk     lki     0.0    40.0
6   3.0   lmnop     wer    21.0     0.0
7   3.0   lmnop     uye     0.0    50.0
8   3.0   lmnop     urg   123.0     0.0

将缺少信息的.txt文件作为pandas Dataframe读取

1 个答案: