ROX-33431: Bump Konflux tasks memory to avoid OOMs by kurlov · Pull Request #20655 · stackrox/stackrox

kurlov · 2026-05-18T14:57:12Z

Description

Per title and ticket.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

No change.

How I validated my change

Few re-runs on this PR show no failures.

github-actions · 2026-05-18T15:10:27Z

🚀 Build Images Ready

Images are ready for commit 5b69e0d. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-1104-g5b69e0dc85

davdhacs

+1

4Gi is confirmed insufficient — OOMKills observed on our own PR (roxctl-on-push-9thcq, main-on-push-847dt) with the 4Gi override active. Also confirmed by msugakov on PR #20655. The profiler captured peaks up to 3938Mi but the real spikes exceed 4Gi faster than the 10s sampling interval can capture (pod gets OOMKilled before the next sample). 6Gi gives 2Gi headroom above the highest observed spike and matches the sast-snyk-check step's request (avoids Tekton resource conflicts). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Former cpu setting isn't needed because the one is now set on the task definition: https://github.com/konflux-ci/build-definitions/blob/fc137d3bfba7b1670619dace7e320a017baad2ab/task/prefetch-dependencies-oci-ta/0.3/prefetch-dependencies-oci-ta.yaml#L225-L231

The default is 256Mi https://github.com/konflux-ci/tekton-tools/blob/6b95ba8e7d381ed2b31a779fc07586c0f12f64b5/tasks/rpms-signature-scan/0.2/rpms-signature-scan.yaml#L52-L58 Also confirmed this by checking on the cluster.

There were only 14 instances of `create-trusted-artifact` failures, and I think no CPU limit is more guilty there than low mem limit. Although I bump mem limit as well, but just a little bit. Ref https://github.com/konflux-ci/build-definitions/blob/fc137d3bfba7b1670619dace7e320a017baad2ab/task/prefetch-dependencies-oci-ta/0.3/prefetch-dependencies-oci-ta.yaml#L243-L248

The default values aren't set in the definition, they get computed based on Tekton's logic by dividing the Task resources between all containers. I see this on the cluster: ``` name: step-use-trusted-artifact resources: limits: memory: 2Gi requests: cpu: 33m memory: "89478485" ``` Probably, would have been sufficient to set limits==requests==2Gi, but making it a bit higher just in case.

The defaults I saw on the cluster: ``` name: step-use-trusted-artifact resources: limits: memory: 4Gi requests: cpu: "1" memory: 4Gi ``` I think, the lack of CPU limit could be the reason rather than memory.

by analogy with #20625

msugakov · 2026-05-20T10:40:55Z

Note on GOMEMLIMIT

It seems to be injected without surprises: no other environment variables are lost. On the screenshot: left is from pod before GOMEMLIMIT was added, right - after.

Few nuances:

It does get set on all containers of the pod due to being in podTemplate. The ones where we override memory and the ones where we don't.
It should only be recognized by Go programs.
It is an advisory setting, the actual memory usage can go higher. Just Go GC will try harder (slowing up the program up to 2x) to clean memory when its use is approaching GOMEMLIMIT.
Therefore there should be some buffer between GOMEMLIMIT and memory limit, although it has to be small.
Yes, in this revision I have some tasks where the buffer is large, but otherwise I would not be able to find a setting that fits all (due to being applied to all containers).
- Worst cases: we reserve some idle memory, or GC overhead occurs earlier than it should. Neither is critical, IMO.

Kudos to @davdhacs for researching and suggesting it in github.com//pull/20625.

in order to try speed them up. They take 2-5 minutes across the board and I hope bumping that would give us noticeable benefit.

msugakov

Self-approving, because I can.

to trigger retests as Konflux seems not receptible to /test or /retest at them moment.

davdhacs · 2026-05-20T13:13:56Z

Yes, in this revision I have some tasks where the buffer is large, but otherwise I would not be able to find a setting that fits all (due to being applied to all containers).
Worst cases: we reserve some idle memory, or GC overhead occurs earlier than it should. Neither is critical, IMO.

I think this is okay for testing, but won't it limit the schedule-ability of the tasks if the request is kept high?

msugakov · 2026-05-20T13:32:34Z

I think this is okay for testing, but won't it limit the schedule-ability of the tasks if the request is kept high?

Yes, increasing memory requests creates more demand on the cluster's resources and so it may happen that our pods/tasks/pipelines can be getting stuck waiting.

Given that GOMEMLIMIT is only a soft hint and by itself not capable of preventing OOMKills (experience from Sensor), adjusting the memory requests is the only working way to prevent OOMs.

I don't see any way around it. Do you?

davdhacs · 2026-05-20T15:08:01Z

I think this is okay for testing, but won't it limit the schedule-ability of the tasks if the request is kept high?

Yes, increasing memory requests creates more demand on the cluster's resources and so it may happen that our pods/tasks/pipelines can be getting stuck waiting.

Given that GOMEMLIMIT is only a soft hint and by itself not capable of preventing OOMKills (experience from Sensor), adjusting the memory requests is the only working way to prevent OOMs.

I don't see any way around it. Do you?

I think with the GOMEMLIMIT, we'll begin to see the tasks memory use to flatten out at/near that memlimit. So maybe it could be a TODO to reduce the request(and possibly limit) once we measure the memory use to be stable around GOMEMLIMIT?

That's why I'm thinking we can set a lower request (for 95% of running time) and then a limit above where we measure it can burst to. That does open a risk of, if there are many tasks that burst simultaneously they will exhaust the memory. But in my monitoring I only saw these burst happen for a max of 10s (with GOMEMLIMIT below the limit) and not at the same time for the set of jobs for one commit. So I think the risk is very low and less likely than the OOMs for certain.

to trigger CI again.

msugakov · 2026-05-20T16:04:56Z

https://konflux-ui.apps.stone-prd-rh01.pg1f.p1.openshiftapps.com/ns/rh-acs-tenant/applications/acs/pipelineruns/retag-scanner-db-slim-jn6sk -> Quay issue.

msugakov · 2026-05-20T16:05:04Z

/konflux-retest retag-scanner-db-slim

for another run

…refetch-dependencies-mem

Feeling eagerness to merge today and begin backporting.

msugakov · 2026-05-21T12:04:47Z

@tommartensen please feel free to share your comments after this gets merged. I'll keep an eye and will address in a follow-up.

openshift-ci · 2026-05-21T13:29:44Z

@kurlov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/gke-qa-e2e-tests	`05e8a73`	link	false	`/test gke-qa-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Bump prefetch-dependencies memory

190f111

kurlov requested review from a team and rhacs-bot as code owners May 18, 2026 14:57

rhacs-bot requested a review from a team May 18, 2026 14:57

github-actions Bot added the konflux-build Run Konflux in PR. Push commit to trigger it. label May 18, 2026

lvalerom approved these changes May 18, 2026

View reviewed changes

Fix step name

39a961a

tommartensen previously requested changes May 18, 2026

View reviewed changes

Comment thread .tekton/operator-bundle-build.yaml Outdated

Bump memory

981b6ec

kurlov requested a review from tommartensen May 18, 2026 15:16

kurlov changed the title ~~fix(ci): Bump Konflux operator-bundle-build prefetch-dependencies task memory~~ fix(ci): Bump Konflux operator-bundle-build step memory May 18, 2026

davdhacs approved these changes May 18, 2026

View reviewed changes

msugakov reviewed May 18, 2026

View reviewed changes

Comment thread .tekton/operator-bundle-build.yaml

msugakov changed the title ~~fix(ci): Bump Konflux operator-bundle-build step memory~~ ROX-33431: Bump Konflux operator-bundle-build step memory May 19, 2026

msugakov changed the title ~~ROX-33431: Bump Konflux operator-bundle-build step memory~~ ROX-33431: Bump Konflux tasks memory to avoid OOMs May 19, 2026

msugakov assigned msugakov and unassigned msugakov May 19, 2026

msugakov added 2 commits May 19, 2026 19:44

Reorder task overrides and bump memory to 6Gi, for operator bundle

3bce732

msugakov marked this pull request as draft May 19, 2026 17:47

openshift-ci Bot added the do-not-merge/work-in-progress label May 19, 2026

msugakov added 2 commits May 19, 2026 19:52

Bump rpms-signature-scan memory

1be8f08

The default is 256Mi https://github.com/konflux-ci/tekton-tools/blob/6b95ba8e7d381ed2b31a779fc07586c0f12f64b5/tasks/rpms-signature-scan/0.2/rpms-signature-scan.yaml#L52-L58 Also confirmed this by checking on the cluster.

msugakov force-pushed the akurlov/fix-bump-operator-bundle-on-push-prefetch-dependencies-mem branch from 379932a to 9dbaebe Compare May 19, 2026 18:03

msugakov added 3 commits May 19, 2026 20:11

Override use-ta resources for build-images/-container

d3a35b7

The defaults I saw on the cluster: ``` name: step-use-trusted-artifact resources: limits: memory: 4Gi requests: cpu: "1" memory: 4Gi ``` I think, the lack of CPU limit could be the reason rather than memory.

Reuse ta task resources via anchors and aliases, add for sast tasks

dfea712

Try inject GOMEMLIMIT

ec59be9

by analogy with #20625

msugakov self-assigned this May 20, 2026

Bump cpu from 1 to 2 for TA tasks

a645d32

in order to try speed them up. They take 2-5 minutes across the board and I hope bumping that would give us noticeable benefit.

msugakov approved these changes May 20, 2026

View reviewed changes

stackrox deleted a comment from openshift-ci Bot May 20, 2026

Empty commit

a420a9b

to trigger retests as Konflux seems not receptible to /test or /retest at them moment.

davdhacs reviewed May 20, 2026

View reviewed changes

Comment thread .tekton/main-build.yaml

Comment thread .tekton/main-build.yaml

Empty commit

1db9c01

to trigger CI again.

davdhacs mentioned this pull request May 20, 2026

fix(ci): remove prefetch cpu:2 throttle — GOMEMLIMIT is faster #20703

Closed

5 tasks

Empty commit

11cb741

for another run

msugakov marked this pull request as ready for review May 21, 2026 07:58

openshift-ci Bot removed the do-not-merge/work-in-progress label May 21, 2026

Merge branch 'master' into akurlov/fix-bump-operator-bundle-on-push-p…

05e8a73

…refetch-dependencies-mem

msugakov enabled auto-merge (squash) May 21, 2026 12:03

msugakov merged commit 5b69e0d into master May 21, 2026
107 of 110 checks passed

msugakov deleted the akurlov/fix-bump-operator-bundle-on-push-prefetch-dependencies-mem branch May 21, 2026 14:57

Conversation

kurlov commented May 18, 2026 • edited by msugakov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

davdhacs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

msugakov commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

msugakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

davdhacs commented May 20, 2026

Uh oh!

msugakov commented May 20, 2026

Uh oh!

davdhacs commented May 20, 2026

Uh oh!

msugakov commented May 20, 2026

Uh oh!

msugakov commented May 20, 2026

Uh oh!

msugakov commented May 21, 2026

Uh oh!

openshift-ci Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kurlov commented May 18, 2026 •

edited by msugakov

Loading

github-actions Bot commented May 18, 2026 •

edited

Loading

msugakov commented May 20, 2026 •

edited

Loading