Question

我的excel文件中包含以下格式的年度数据：

Country \ Years   1980   1981   ...   2010
Abkhazia           234    334   ...    456
Afghanistan        466    789   ...    732
...

这是图片

我希望我的数据转换为3个不同的表并将其加载到postgres数据库。

表应该看起来像那样

第一张表 - 国家：

id | name
1  | Abkhazia
2  | Afghanistan

第二个表日期：

id | date
1  | 1980
2  | 1981

第三个是根据国家和日期存储所有数据的表格：

country_id    date_id   data
         1          1    234
         1          2    334
         2          1    466
         2          2    789
       ...        ...    ...

任何想法如何实现我的目标？

Answer 1

假设源excel结构如下所示（我已自定义构建）：

你的问题基本上有三个部分。为了更好地理解，我将转变分解为一部分：

<强> 1。载入表 - 国家

根据excel中给出的数据，这非常简单。只需要

Excel Input >> Add a sequence step. Give the Sequence name as Country ID >> Select only the Country Name and Country ID >> Load into the Country Table using Table Output。

<强> 2。装货表 - 年份：

这里的想法是以行方式显示年份ID，而不是给出excel源数据的列。 PDI版本5及更高版本为您提供了一个非常有用的步骤Metadata Structure。此步骤允许您获取表的结构。在这种情况下，我们需要拉出年份列，忽略国家/地区列。

按照以下步骤操作：

Read the Excel Data >> Get the Metadata structure of your source >> Filter Out the Country Column (which is available in row at position=1) >> Add a Sequence Number. Name it YearID >> Finally Load the Year Table.

第3。加载决赛桌 - 国家和年份以及数据：

在PDI中将所有列数据值显示为行级别的方法是使用Row Normalizer步骤。使用此步骤显示标准化输出。现在按照以下步骤操作：

Read the Excel source data >> use Row Normalizer Step to normalize the rows based on the Years >> Do a Stream Lookup with the Above Country and Year tables to fetch the CountryID and YearID respectively >> Finally Load the necessary column data into Table Output

希望有所帮助：）

我已将代码放在github repo中以及我使用的数据文件中。它的here。

另外，只是意识到我根据你的问题给出了错误的命名约定。将date_id视为YearID而不是id我已经给出了countryid和yearid。

从excel文件转换Pentaho勺子

1 个答案: