Optimization#
The objective of optimization is to remove as many tasks from the graph as possible, as efficiently as possible. This helps reduce costs and deliver results as quickly as possible. For example, if a push only modifies a single test, then ideally we only run the tasks necessary to run that test.
A task is said to be “optimized” when it is either replaced with an equivalent, already existing task, or dropped from the graph entirely.
Optimization Strategies#
Each task has a single named optimization strategy, and can provide an argument
to that strategy. Each strategy is defined as an
OptimizationStrategy
.
Each task has a task.optimization
property describing the optimization
strategy that applies, specified as a dictionary mapping strategy to argument. For
example:
task.optimization = {'skip-unless-changed': ['js/**', 'tests/**']}
Strategy implementations are shared across all tasks, so they may cache commonly-used information as instance variables.
See Optimization Strategies for an overview of the strategies that are included with Taskgraph.
Optimization Process#
Optimization proceeds in three phases: removing tasks, replacing tasks, and finally generating a subgraph containing only the remaining tasks.
Assume the following task graph as context for these examples:
Removing Tasks#
This phase begins with tasks on which nothing depends (leaf nodes) and follows
the dependency graph backward from there – right to left in the diagram above.
If a task is not removed, then nothing it depends on will be removed either.
Thus if G
and H
are both removed, C
and D
may be removed as well.
We then evaluate the new leaf nodes, so if D
is removed, then A
and B
may be removed as well. And so on until there are no tasks left to evaluate.
If G
is removed but H
is not removed, then the entire graph will be kept
as all tasks (other than G
) are in H
’s dependency chain.
For each task with no remaining dependencies, the decision whether to remove is
made by calling the optimization strategy’s should_remove_task
method. If
this method returns True, the task is removed.
The optimization process takes a do_not_optimize
argument containing a list
of tasks that cannot be removed under any circumstances. This can be used to
“force” running specific tasks in various contexts.
Replacing Tasks#
This phase begins with tasks without dependencies (other than the decision
task) and follows the reversed dependency graph from there – left to right in
the diagram above. If a task is not replaced, then anything depending on that
task cannot be replaced. Replacement is generally done on the basis of some
hash of the inputs to the task. In the diagram above, if both A
and B
are
replaced with existing tasks, then D
is a candidate for replacement. But if
B
has no replacement, then replacement of D
will not be considered.
It is possible to replace a task with nothing. This is similar to optimizing
away, but is useful for utility tasks (possibly like G
). If such a task is
considered for replacement, then all of its dependencies (here, D
) have
already been replaced and there is no utility in running the task and no need
for a replacement task. It is an error for a task on which others depend to be
replaced with nothing.
The do_not_optimize
set applies to task replacement, as does an additional
existing_tasks
dictionary which allows the caller to supply a set of
known, pre-existing tasks. This is used for action tasks, for example, where it
contains the entire task-graph generated by the original decision task.
Subgraph Generation#
The first two phases annotate each task in the existing taskgraph with their
fate: removed, replaced, or kept. The tasks that are replaced also have a
replacement taskId
.
The last phase constructs a subgraph containing the retained tasks, and
simultaneously rewrites all dependencies to refer to taskIds
instead of labels.
To do so, it assigns a taskId
to each retained task and uses the replacement
taskId for all replaced tasks.
The soft-dependencies are then solved for each task, by adding all the remaining tasks in the subgraph from that list to its dependencies.
The result is an optimized taskgraph with tasks named by taskId
instead of
label. At this phase, the edges in the task graph diverge from the
task.dependencies
attributes, as the latter may contain dependencies
outside of the taskgraph (for replacement tasks).
As a side-effect, this phase also expands all {"task-reference": ".."}
and
{"artifact-reference": ".."}
objects within the task definitions.