Contents |
Introduction to PDI Kettle
Kettle
The Pentaho Data Integration Suite is an Open source ETL (Extract-Transform-Load) application software which enables to migrate and/or convert data and schedule these transformations without any custom code. This Data Integration Suite is a part of Pentaho BI Project for enterprise reporting, analysis, data mining, work flow and obviously ETL jobs. For more informations about the Pentaho BI Project: Pentaho BI Project
This suite is divided in 3 sub-applications:
- Spoon:
Spoon is a graphical tool dedicated to performing jobs and apply transformations to a specified data set (text file, DB data, .xml file, etc...).
- Pan:
Pan is a command line tool dedicated to executing Spoon created transformations. Scheduling transformations is available with Pan and the use of a scheduling jobs tool (ex: Cron on Unix systems).
- Kitchen:
Kitchen is the same as Pan but for jobs and not transformations. Thus, Kitchen may execute Spoon created jobs. However, Kitchen cannot perform scheduling.
Jobs and transformations
With Talend, there is no difference between jobs and transformations. They are both manipulated under the job name. In Kettle, the difference between jobs and transformations is easy to understand. A transformation is a special job like extracting from a database, reading a file, or feeding a database and writing in a file. To sum up, a transformation is dedicated to data migration. A job is a cluster of transformations and simple tasks. For example, a simple task could be creating a file or an archive, deleting a file or verifying the existence of a file. Job is more on a logical schema. You will be given an example further which over shows the difference.
Install and launch
Install Kettle
First, you have to download the Kettle application software on the Pentaho website (Download). Choose the latest stable release. In this article, we used Data Integration v3.0.2. Second, unzip the archive in the directory of your choice. Kettle has been installed. We assume that you unzip the archive under a Kettle directory. Therefore, the file system is like: <root dir>/Kettle/Kettle-version_number/
The whole file system is composed like following:
- <root dir>/Kettle/Kettle-version_number/docs
- <root dir>/Kettle/Kettle-version_number/extra
- <root dir>/Kettle/Kettle-version_number/lib
- <root dir>/Kettle/Kettle-version_number/libext
- <root dir>/Kettle/Kettle-version_number/libswt
- <root dir>/Kettle/Kettle-version_number/META-INF
- <root dir>/Kettle/Kettle-version_number/plugins
- <root dir>/Kettle/Kettle-version_number/pwd
- <root dir>/Kettle/Kettle-version_number/samples
- <root dir>/Kettle/Kettle-version_number/simple-jndi
- <root dir>/Kettle/Kettle-version_number/ui
Launch Kettle's components
There are a .sh (Unix systems) and a .bat (Windows systems) for each component of the PDI Suite. The .sh and .bat executables are at the root of Kettle-version_number.
Permissions
Before launching any component, make sure that you have the permissions to execute the program by opening a command console and typing the following command:
~/Kettle/Kettle-3.0.2$ chmod +x *.sh
Launch Spoon
Thus, you may execute the different PDI components. Now you have the required permissions, you may launch any component. To launch Spoon, simply type in in the command line console:
~/Kettle/Kettle-3.0.2$ ./spoon.sh
This will open the Spoon graphical tool. You will be shown how it works further.
Launch Pan
To launch Pan, you should have first created a spoon transformation to be able to launch it via Pan. Then, you can execute the transformation specifying the user, the password, the repository and the transformation name. In the console, type in:
~/Kettle/Kettle-3.0.2$ ./pan.sh -user=user_name -pass=password -rep=repository_name -trans=transformation_name
This will perform the transformation directly without using Spoon.
Launch Kitchen
To launch Kitchen, proceed as for Pan but not for a transformation but for a job. In the console, type in:
~/Kettle/Kettle-3.0.2$ ./kitchen.sh -user=user_name -pass=password -rep=repository_name -job=job_name
This will execute the specified job directly without opening Spoon.
Execution
Method
Creating a new repository or using an existing one
This part is dedicated to showing how the Spoon component works. Once you launched Spoon (fig.1), you will be asked to specify a new repository (or use an existing one).
fig.1
To create a new repository, click on 'New' on the right hand side of the repository field. On the repository informations window (fig.2), click on 'New connection' on the upper right hand side and fill in all the required informations concerning your database settings.
fig.2
Once the fields filled in, you can test if the connection is working OK. Then, click 'OK'. Give a name to this new repository and click 'OK' again. Finally, use the Kettle login and password (by default: admin, admin). This will open the Spoon window (fig.3).
fig.3
To directly use an existing repository, just fill in the Kettle login and password.
Creating a new job or a new transformation
To create a new job, click on 'New' -> 'Job'; a new job window will be opened and all the different tasks will be available on the left hand side (fig.4).
fig.4
To create a new transformation, click on 'New' -> 'Transformation'; a new transformation window will appear with all the different cases on the left hand side (fig.5).
fig.5
If you want to open an existing job or transformation, just click on 'Open' and select the required type (fig.6).
fig.6
You might want to import existing jobs and transformations or want to export your work. Just click on 'File' on the left top corner and select what you want to do. You can export to xml files or import from xml files (.kjb for jobs, .ktr for transformations).
Pan can launch those transformations directly. Just type in:
~/Kettle/Kettle-3.0.2$ ./pan.sh -user=user_name -pass=password -rep=repository_name -file=ktr_file_name
Kitchen can launch those jobs too but this requires to specify the job and the .kjb file. Just type in:
~/Kettle/Kettle-3.0.2$ ./kitchen.sh -user=user_name -pass=password -rep=repository_name -job=job_name -file=kjb_file_name
Note: Be aware that if you want to perform an exported job with Kitchen, you have to know the exact name of the job. Not just the name of the exported file.
Example
In order to understand the meaning of job and transformation, you are going to be shown a simple example of job using a single transformation. In this example, the purpose is to feed a text file with data coming from a MySQL stored table.
First, a new job window has to be opened as described in the above part (3.a.2). This created job is called 'JOB'. Drag and drop the start icon from the left hand side menu to the job window. Without this step, the job cannot be executed.
Then, a new text file called JOB.txt has to be created in order to write into. Drag and drop the create file icon on the main window. Just to avoid faulty jobs, a checking is added. If this file really exists, the job will run. In the other case, it will stop at this step. Drag and drop the check file existence icon as for creating a new file. To make definitely sure the connection is OK, a data table checking has to be inserted by dragging and dropping the check table existence icon to the job window. In order to edit the different steps, just double-click on the task icon on the job window and adjust the settings (fig.7).
fig.7
Thus, we just created a start step, a new text file and two checking out for the created file and the data table (fig.8).
fig.8
To make sure all the steps are run one by one, you have to link them. First, press the shift button and select an icon on the main window by clicking it. Keep shift and the mouse button pressed till the second icon. It will create a hop between those two tasks (fig.9).
fig.9
Then, it is the time to create the transformation. Click on 'New'-> 'Transformation'. Expand the extraction sub menu, drag and drop the 'extraction from a table' icon to the main window (fig.10).
fig.10
Proceed the same way with feeding. Just drag and drop the 'file feeding' icon (fig.11). Then, link the two icons as for the job.
fig.11
Moreover, edit the transformation by double-clicking the two icons and setting up the extraction and feeding (fig.12). For your convenience, you may create a SQL query while extracting.
fig.12
Now, this new transformation and the main job have to be associated. Go back on the job window. Drag and drop the transformation execution icon to the job main window. Make the hops and add a final step to finish and validate the job (successful job icon).
Finally, our example may be executed by clicking the 'execute' icon on the top of the window. It will open an execution pop up window (fig.13).
fig.13
Click on 'Start', the job and its included transformation will be executed. If the job succeeded, in the execution log window, you would be able to see the last step result was true (fig.14).
fig.14
If the job failed, this result would be false (fig.15). On this screen shot, the file creation was deleted before execution in order to fail the checking.
fig.15
In the end, it will create a new text file with the content of your SQL query. It just depends on your settings (creation directory, file name, database, selected table).
nb: When adding a new icon (task or transformation) that you want to insert between two linked tasks, just drag and drop the new icon on the hop. You may see the hop widening. Release the mouse button and you will be asked to confirm the insertion (fig.16). Click 'yes'.
fig.16
nb2: When a scheduling is required, just double-click on the start icon of your job and set up the scheduling interval (fig.17). The automation will be active as long as you do not shut down the KETTLE application. Even if you manually stopped the scheduling, it would be active.
fig.17
nb3: You may want to add notes in order to make clear all the different steps. Just right click on the main window where you want to add the note and click on 'New note' (fig.18).
fig.18
Edit the note text and click OK. You can still edit or delete the note by right-clicking the note and select the required action (fig.19).
fig.19
nb4: In order to prevent from execution problems, you would better execute a job or transformation in a step by step manner. Each time a new hop has been added, execute the job (or transformation) to check out if you can go further.
Finally, a job and a transformation ('JOB' and 'TRANSFORMATION') have been created. Thus, you may use the other tools within Pentaho Data Integration Suite to perform command line console jobs or transformations. So for this example, if you want to perform the JOB job, open a command line console and type in:
~/Kettle/Kettle-3.0.2$ ./kitchen.sh -user=user_name -pass=password -rep=repository_name -job=JOB
If you want to perform the transformation, just type in:
~/Kettle/Kettle-3.0.2$ ./pan.sh -user=user_name -pass=password -rep=repository_name -trans=TRANSFORMATION
Example: Excel file data integration
In this example, the main described part is not dedicated to giving you the step by step procedure but an overview of differences between .xls extraction and database extraction.
The purpose is to extract data from a .xls file and feed in a .xml file. Then, this .xml file will be validated by .xsd, archived and finally deleted. The main matter is to get the different fields from the .xls file and then create the .xml file according to these fields. Also, you have to create a .xsd file in order to check out the validity of your .xml file according to the settings you made.
This example is composed of one job and one transformation ('job_name' (fig.20) and 'transformation_name' (fig.21)).
fig.20
fig.21
First, the existence of the .xls file and the newly created .xml file are checked out. Then, is proceeded the transformation. Extract data from the .xls file by using 'extracting from an excel file'. When extracting, choose the target file, choose the selected sheet(s), get the fields name (fig.22) and use the preview to test the effectiveness of your settings (fig.23).
fig.22
fig.23
Then, feed in the new .xml file by using '.xml file feeding'. To edit the feeding, double-click on the '.xml feeding', choose the target file (be aware that this new file might not exist yet), edit the content (.xml elements name) and get the fields name from the .xls file (fig.24).
fig.24
Choose a new element name for each field, they will be the .xml children elements name. If you do not specify any name, the .xml file will be invalid in the end. Furthermore, you have to validate the .xml file with .xsd. Thus, you may check out the existence of this .xsd file and have to add the validation step. Of course, this step depends on your .xsd file. If this file has not been built correctly, even if your .xml file is valid, the job will not run to the end.
The last part is not necessary, it depends on your requirements but is interesting when discovering KETTLE. Indeed, it enables to see how file retrieving is processed.
An archive is finally created from the .xml file. You just have to add the 'create zip archive' step and eventually delete the former file. Be aware that only directories are subject to archiving, not single files. That is the reason you have to create a temporary directory to store the .xml file into and archive.
The execution of this job (fig.25) will create an archive with the .xml file including the content of the .xls file. Finally, execute the job and check out the result of your job and transformation.
fig.25
nb: In order to achieve complex jobs (or transformations), consider testing (step by step execution) as the best solution. Also, the execution log window will give you clues on the faulty steps.
nb2: The use of checking points is required if you want to execute complex jobs (NOT transformations) as advised in the simple example.
nb3: At first, favour the job part, test it and only once tested OK, work on the transformation part. Then, you will be aware of what might cause any problem.
Finally, a job and a transformation ('job_name' and 'transformation_name') are created. Thus, you may use the other tools within Pentaho Data Integration Suite. So for this example, if you want to perform the job_name job, open a command line console and type in:
~/Kettle/Kettle-3.0.2$ ./kitchen.sh -user=user_name -pass=password -rep=repository_name -job=job_name
If you want to perform the transformation, just type in:
~/Kettle/Kettle-3.0.2$ ./pan.sh -user=user_name -pass=password -rep=repository_name -trans=transformation_name
