Rules of thumb to split components
It can be difficult to decide how to split your pipeline into components. Overall, components should aim to be a self-contained business logic unit with minimal inputs and outputs.
The components in vertex/pipelines/my_first_pipeline.py
are very small.
In practice for such a simple pipeline a single component would be better, but the goal here is to illustrate how
to pass parameters within a pipeline.
When to split a component
Having a big monolithic pipeline made of one component is obviously not ideal.
To iterate faster: if you find yourself waiting for a lot of code to execute before the execution flow gets to your actual changes, then it's probably a good reason to split your component. For example, if you changed some model training parameters and need to wait for a data preprocessing step for 10 minutes before your model is retrained, split the two steps and save the training dataset as an artifact.
To leverage orchestration: parallelization can be achieved fairly easily in pipelines. If some processing is easily splittable, don't hesitate to use this at your advantage for a faster pipeline. If you are trying to find the right hyper-parameters for a training, you could have one component by hyper-parameter set:
Splitting a grid search between components
@kfp.dsl.pipeline(name="find-best-hyper-parameters")
def pipeline(project_id: str, input_table: str):
load_data_task = load_training_data(project_id=project_id, gcp_region="europe-west1", input_table=input_table)
hyper_parameters_to_test = range(10)
grid_search_results = []
for hyper_parameter in hyper_parameters_to_test:
result = train_and_evaluate(hyper_parameter, load_data_task.outputs["training_data"])
grid_search_results.append(str(result.outputs["result"]))
save_best_model(grid_search_results)
In the UI:
When to merge two components
On the other end, it is not ideal either to have micro-components that barely do anything. You are going to encounter performance issues in your pipeline due to the overhead for component initialization, as well as having to manage a lot of tedious artifact management to pass data around your pipeline.
If you find yourself making changes on 4+ components to add a feature, you should probably merge some of them.