Skip to content

ROX-33431: Bump Konflux tasks memory to avoid OOMs#20655

Merged
msugakov merged 18 commits into
masterfrom
akurlov/fix-bump-operator-bundle-on-push-prefetch-dependencies-mem
May 21, 2026
Merged

ROX-33431: Bump Konflux tasks memory to avoid OOMs#20655
msugakov merged 18 commits into
masterfrom
akurlov/fix-bump-operator-bundle-on-push-prefetch-dependencies-mem

Conversation

@kurlov

@kurlov kurlov commented May 18, 2026

Copy link
Copy Markdown
Member

Description

Per title and ticket.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

No change.

How I validated my change

Few re-runs on this PR show no failures.

@kurlov kurlov requested review from a team and rhacs-bot as code owners May 18, 2026 14:57
@rhacs-bot rhacs-bot requested a review from a team May 18, 2026 14:57
@github-actions github-actions Bot added the konflux-build Run Konflux in PR. Push commit to trigger it. label May 18, 2026
Comment thread .tekton/operator-bundle-build.yaml Outdated
@github-actions

github-actions Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit 5b69e0d. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-1104-g5b69e0dc85

@kurlov kurlov requested a review from tommartensen May 18, 2026 15:16
@kurlov kurlov changed the title fix(ci): Bump Konflux operator-bundle-build prefetch-dependencies task memory fix(ci): Bump Konflux operator-bundle-build step memory May 18, 2026

@davdhacs davdhacs left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment thread .tekton/operator-bundle-build.yaml
davdhacs added a commit that referenced this pull request May 18, 2026
4Gi is confirmed insufficient — OOMKills observed on our own PR
(roxctl-on-push-9thcq, main-on-push-847dt) with the 4Gi override
active. Also confirmed by msugakov on PR #20655.

The profiler captured peaks up to 3938Mi but the real spikes exceed
4Gi faster than the 10s sampling interval can capture (pod gets
OOMKilled before the next sample).

6Gi gives 2Gi headroom above the highest observed spike and matches
the sast-snyk-check step's request (avoids Tekton resource conflicts).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msugakov msugakov changed the title fix(ci): Bump Konflux operator-bundle-build step memory ROX-33431: Bump Konflux operator-bundle-build step memory May 19, 2026
@msugakov msugakov changed the title ROX-33431: Bump Konflux operator-bundle-build step memory ROX-33431: Bump Konflux tasks memory to avoid OOMs May 19, 2026
@msugakov msugakov assigned msugakov and unassigned msugakov May 19, 2026
@msugakov msugakov marked this pull request as draft May 19, 2026 17:47
msugakov added 2 commits May 19, 2026 19:52
There were only 14 instances of `create-trusted-artifact` failures,
and I think no CPU limit is more guilty there than low mem limit.
Although I bump mem limit as well, but just a little bit.

Ref
https://github.com/konflux-ci/build-definitions/blob/fc137d3bfba7b1670619dace7e320a017baad2ab/task/prefetch-dependencies-oci-ta/0.3/prefetch-dependencies-oci-ta.yaml#L243-L248
@msugakov msugakov force-pushed the akurlov/fix-bump-operator-bundle-on-push-prefetch-dependencies-mem branch from 379932a to 9dbaebe Compare May 19, 2026 18:03
msugakov added 3 commits May 19, 2026 20:11
The default values aren't set in the definition, they get computed
based on Tekton's logic by dividing the Task resources between all
containers.

I see this on the cluster:
```
      name: step-use-trusted-artifact
      resources:
        limits:
          memory: 2Gi
        requests:
          cpu: 33m
          memory: "89478485"
```

Probably, would have been sufficient to set limits==requests==2Gi,
but making it a bit higher just in case.
The defaults I saw on the cluster:
```
      name: step-use-trusted-artifact
      resources:
        limits:
          memory: 4Gi
        requests:
          cpu: "1"
          memory: 4Gi
```

I think, the lack of CPU limit could be the reason rather than
memory.
@msugakov msugakov self-assigned this May 20, 2026
@msugakov

msugakov commented May 20, 2026

Copy link
Copy Markdown
Contributor

Note on GOMEMLIMIT
image

It seems to be injected without surprises: no other environment variables are lost. On the screenshot: left is from pod before GOMEMLIMIT was added, right - after.

Few nuances:

  • It does get set on all containers of the pod due to being in podTemplate. The ones where we override memory and the ones where we don't.
  • It should only be recognized by Go programs.
  • It is an advisory setting, the actual memory usage can go higher. Just Go GC will try harder (slowing up the program up to 2x) to clean memory when its use is approaching GOMEMLIMIT.
  • Therefore there should be some buffer between GOMEMLIMIT and memory limit, although it has to be small.
  • Yes, in this revision I have some tasks where the buffer is large, but otherwise I would not be able to find a setting that fits all (due to being applied to all containers).
    • Worst cases: we reserve some idle memory, or GC overhead occurs earlier than it should. Neither is critical, IMO.

Kudos to @davdhacs for researching and suggesting it in github.com//pull/20625.

in order to try speed them up. They take 2-5 minutes across the
board and I hope bumping that would give us noticeable benefit.

@msugakov msugakov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-approving, because I can.

@stackrox stackrox deleted a comment from openshift-ci Bot May 20, 2026
to trigger retests as Konflux seems not receptible to /test or
/retest at them moment.
Comment thread .tekton/main-build.yaml
Comment thread .tekton/main-build.yaml
@davdhacs

Copy link
Copy Markdown
Contributor

Yes, in this revision I have some tasks where the buffer is large, but otherwise I would not be able to find a setting that fits all (due to being applied to all containers).
Worst cases: we reserve some idle memory, or GC overhead occurs earlier than it should. Neither is critical, IMO.

I think this is okay for testing, but won't it limit the schedule-ability of the tasks if the request is kept high?

@msugakov

Copy link
Copy Markdown
Contributor

I think this is okay for testing, but won't it limit the schedule-ability of the tasks if the request is kept high?

Yes, increasing memory requests creates more demand on the cluster's resources and so it may happen that our pods/tasks/pipelines can be getting stuck waiting.

Given that GOMEMLIMIT is only a soft hint and by itself not capable of preventing OOMKills (experience from Sensor), adjusting the memory requests is the only working way to prevent OOMs.

I don't see any way around it. Do you?

@davdhacs

Copy link
Copy Markdown
Contributor

I think this is okay for testing, but won't it limit the schedule-ability of the tasks if the request is kept high?

Yes, increasing memory requests creates more demand on the cluster's resources and so it may happen that our pods/tasks/pipelines can be getting stuck waiting.

Given that GOMEMLIMIT is only a soft hint and by itself not capable of preventing OOMKills (experience from Sensor), adjusting the memory requests is the only working way to prevent OOMs.

I don't see any way around it. Do you?

I think with the GOMEMLIMIT, we'll begin to see the tasks memory use to flatten out at/near that memlimit. So maybe it could be a TODO to reduce the request(and possibly limit) once we measure the memory use to be stable around GOMEMLIMIT?

That's why I'm thinking we can set a lower request (for 95% of running time) and then a limit above where we measure it can burst to. That does open a risk of, if there are many tasks that burst simultaneously they will exhaust the memory. But in my monitoring I only saw these burst happen for a max of 10s (with GOMEMLIMIT below the limit) and not at the same time for the set of jobs for one commit. So I think the risk is very low and less likely than the OOMs for certain.

to trigger CI again.
@msugakov

Copy link
Copy Markdown
Contributor

@msugakov

Copy link
Copy Markdown
Contributor

/konflux-retest retag-scanner-db-slim

for another run
@msugakov msugakov marked this pull request as ready for review May 21, 2026 07:58
@msugakov msugakov dismissed tommartensen’s stale review May 21, 2026 12:03

Feeling eagerness to merge today and begin backporting.

@msugakov msugakov enabled auto-merge (squash) May 21, 2026 12:03
@msugakov

Copy link
Copy Markdown
Contributor

@tommartensen please feel free to share your comments after this gets merged. I'll keep an eye and will address in a follow-up.

@openshift-ci

openshift-ci Bot commented May 21, 2026

Copy link
Copy Markdown

@kurlov: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/gke-qa-e2e-tests 05e8a73 link false /test gke-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

disable-konflux-auto-retest Disable automatic Konflux pipeline re-runs konflux-build Run Konflux in PR. Push commit to trigger it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants