ROX-33431: Bump resources again to prevent OOM kills#20902
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There are only 3 failures in the last 7 days and I think it may
be more likely caused by unbounded CPU than by insufficient memory.
Trying to limit the CPU as the first step.
Here's what I found for oom-killed containers on the cluster:
```
name: step-build
resources:
limits:
memory: 3Gi
requests:
cpu: 250m
memory: 3Gi
```
Because still saw OOM kill on master.
In `main-on-push-x5lxh-prefetch-dependencies-pod`:
```
name: step-prefetch-dependencies
resources:
limits:
cpu: "1"
memory: 5Gi
requests:
cpu: "1"
memory: 5Gi
```
```
name: step-prefetch-dependencies
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: cri-o://3a3d5195556288c2d164f1fe7d03aa2da4df6f25ba79ba55f911ac0e2c5cde40
exitCode: 1
finishedAt: "2026-05-29T16:28:49Z"
message: '[{"key":"StartedAt","value":"2026-05-29T16:26:49.307Z","type":3}]'
reason: OOMKilled
startedAt: "2026-05-29T16:26:39Z"
```
Don't know if just 1Gb will be enough, but found just 3 examples in
the last 7 days so I hope it will.
- main-on-push-x5lxh-prefetch-dependencies-pod
- operator-on-push-9gpwt-prefetch-dependencies-pod
- roxctl-on-push-jpppc-prefetch-dependencies-pod
Still saw failures on master with the 5G setting.
For example, in `main-on-push-qtwtc-sast-unicode-check-pod`:
```
name: step-use-trusted-artifact
resources:
limits:
cpu: "2"
memory: 5Gi
requests:
cpu: "2"
memory: 5Gi
```
```
name: step-use-trusted-artifact
ready: false
restartCount: 0
started: false
state:
terminated:
containerID: cri-o://c70b7cd68bb9498fc70b65477dbb96da3af650775bad0a9632f833000e951620
exitCode: 2
finishedAt: "2026-05-27T11:37:10Z"
message: '[{"key":"StartedAt","value":"2026-05-27T11:17:39.524Z","type":3}]'
reason: OOMKilled
startedAt: "2026-05-27T11:17:32Z"
```
These, for example, were affected in the last 7 days:
- main-on-push-qtwtc-sast-unicode-check-pod
- operator-on-push-9wlxl-sast-snyk-check-pod
- operator-on-push-vbn5p-build-source-image-pod
- roxctl-on-push-9477x-build-images-3-pod
There's about a dozen of others so I bump memory by 2G.
81ed172 to
a313848
Compare
|
/konflux-retest operator-on-push |
|
/konflux-retest operator-bundle-on-push |
|
/konflux-retest roxctl-on-push |
|
/konflux-retest central-db-on-push |
|
/konflux-retest main-on-push |
|
/konflux-retest scanner-v4-db-on-push |
|
/konflux-retest operator-index-on-push |
|
/konflux-retest scanner-v4-on-push |
|
/konflux-retest central-db-on-push |
|
/konflux-retest roxctl-on-push |
|
/konflux-retest scanner-v4-db-on-push |
|
/konflux-retest scanner-v4-on-push |
|
/konflux-retest checks |
|
/retest |
🚀 Build Images ReadyImages are ready for commit 40f6647. To use with deploy scripts: export MAIN_IMAGE_TAG=4.12.x-96-g40f6647d5f |
|
/retest |
|
/konflux-retest central-db-on-push |
|
/konflux-retest main-on-push |
|
/konflux-retest operator-bundle-on-push |
|
/konflux-retest operator-index-on-push |
|
/konflux-retest operator-on-push |
|
/konflux-retest scanner-v4-db-on-push |
|
/konflux-retest scanner-v4-on-push |
|
/konflux-retest roxctl-on-push |
|
/konflux-retest retag-collector |
|
/konflux-retest retag-fact |
|
/konflux-retest retag-scanner |
|
/konflux-retest retag-scanner-db |
|
/konflux-retest retag-scanner-db-slim |
|
/konflux-retest retag-scanner-slim |
|
/konflux-retest operator-bundle-on-push |
|
/konflux-retest create-custom-snapshot |
|
/konflux-retest operator-index-on-push |
Description
Follows up on #20655 with findings from OOM dashboard. See https://redhat-internal.slack.com/archives/C081BAM590Q/p1780296820165749
Note that I had to manually pull some info from pods to determine for which branch they ran. That's how I narrowed the set of failures to look at.
It's best to review this PR per commits because I shared info in the commit message in each.
User-facing documentation
Testing and quality
Automated testing
No change.
How I validated my change
Only CI. Will run a few times.