ROX-33431: Bump Konflux tasks memory to avoid OOMs#20655
Conversation
🚀 Build Images ReadyImages are ready for commit 5b69e0d. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-1104-g5b69e0dc85 |
4Gi is confirmed insufficient — OOMKills observed on our own PR (roxctl-on-push-9thcq, main-on-push-847dt) with the 4Gi override active. Also confirmed by msugakov on PR #20655. The profiler captured peaks up to 3938Mi but the real spikes exceed 4Gi faster than the 10s sampling interval can capture (pod gets OOMKilled before the next sample). 6Gi gives 2Gi headroom above the highest observed spike and matches the sast-snyk-check step's request (avoids Tekton resource conflicts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Former cpu setting isn't needed because the one is now set on the task definition: https://github.com/konflux-ci/build-definitions/blob/fc137d3bfba7b1670619dace7e320a017baad2ab/task/prefetch-dependencies-oci-ta/0.3/prefetch-dependencies-oci-ta.yaml#L225-L231
The default is 256Mi https://github.com/konflux-ci/tekton-tools/blob/6b95ba8e7d381ed2b31a779fc07586c0f12f64b5/tasks/rpms-signature-scan/0.2/rpms-signature-scan.yaml#L52-L58 Also confirmed this by checking on the cluster.
There were only 14 instances of `create-trusted-artifact` failures, and I think no CPU limit is more guilty there than low mem limit. Although I bump mem limit as well, but just a little bit. Ref https://github.com/konflux-ci/build-definitions/blob/fc137d3bfba7b1670619dace7e320a017baad2ab/task/prefetch-dependencies-oci-ta/0.3/prefetch-dependencies-oci-ta.yaml#L243-L248
379932a to
9dbaebe
Compare
The default values aren't set in the definition, they get computed
based on Tekton's logic by dividing the Task resources between all
containers.
I see this on the cluster:
```
name: step-use-trusted-artifact
resources:
limits:
memory: 2Gi
requests:
cpu: 33m
memory: "89478485"
```
Probably, would have been sufficient to set limits==requests==2Gi,
but making it a bit higher just in case.
The defaults I saw on the cluster:
```
name: step-use-trusted-artifact
resources:
limits:
memory: 4Gi
requests:
cpu: "1"
memory: 4Gi
```
I think, the lack of CPU limit could be the reason rather than
memory.
by analogy with #20625
|
It seems to be injected without surprises: no other environment variables are lost. On the screenshot: left is from pod before Few nuances:
Kudos to @davdhacs for researching and suggesting it in github.com//pull/20625. |
in order to try speed them up. They take 2-5 minutes across the board and I hope bumping that would give us noticeable benefit.
msugakov
left a comment
There was a problem hiding this comment.
Self-approving, because I can.
to trigger retests as Konflux seems not receptible to /test or /retest at them moment.
I think this is okay for testing, but won't it limit the schedule-ability of the tasks if the request is kept high? |
Yes, increasing memory requests creates more demand on the cluster's resources and so it may happen that our pods/tasks/pipelines can be getting stuck waiting. Given that I don't see any way around it. Do you? |
I think with the GOMEMLIMIT, we'll begin to see the tasks memory use to flatten out at/near that memlimit. So maybe it could be a TODO to reduce the request(and possibly limit) once we measure the memory use to be stable around GOMEMLIMIT? That's why I'm thinking we can set a lower request (for 95% of running time) and then a limit above where we measure it can burst to. That does open a risk of, if there are many tasks that burst simultaneously they will exhaust the memory. But in my monitoring I only saw these burst happen for a max of 10s (with GOMEMLIMIT below the limit) and not at the same time for the set of jobs for one commit. So I think the risk is very low and less likely than the OOMs for certain. |
to trigger CI again.
|
/konflux-retest retag-scanner-db-slim |
for another run
…refetch-dependencies-mem
Feeling eagerness to merge today and begin backporting.
|
@tommartensen please feel free to share your comments after this gets merged. I'll keep an eye and will address in a follow-up. |
|
@kurlov: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |

Description
Per title and ticket.
User-facing documentation
Testing and quality
Automated testing
No change.
How I validated my change
Few re-runs on this PR show no failures.