Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃挜 engine: Remove --oci-max-parallelism #7395

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gerhard
Copy link
Member

@gerhard gerhard commented May 16, 2024

Fixes #6894

This is not an option that we should support. Justin puts it best:

I think with (Dagger) Modules, this effectively means that
`max-parallelism` is pretty much just completely broken - it just fits
very badly with the idea of nested containers.

... but to me this kind of highlights that we need a better way of
constraining resources - max-parallelism is just really not fit for
the shape that pipelines tend to take these days. Maybe we should
provide some cgroup config options instead (if the end desire is to
constrain CPU usage/etc).

While this can still be configured if a custom Engine toml config is provided, it means that users went out of their way to have this. It's hard on purpose and we don't recommend using this setting anymore.

This also removes it from the Helm chart.

WDYT @sipsma @jedevc @aluzzardi @vito ?


@matipan if everyone is in agreement with this change, we should remove this from all our infra. We can do that before the release goes out.

@gerhard gerhard added this to the v0.12.0 milestone May 16, 2024
@gerhard gerhard changed the title Remove oci-max-parallelism entirely 馃挜 BREAKING CHANGE 馃挜 Remove oci-max-parallelism May 16, 2024
@gerhard gerhard changed the title 馃挜 BREAKING CHANGE 馃挜 Remove oci-max-parallelism 馃挜 engine: Remove oci-max-parallelism May 16, 2024
@gerhard gerhard changed the title 馃挜 engine: Remove oci-max-parallelism 馃挜 engine: Remove --oci-max-parallelism May 16, 2024
@gerhard gerhard force-pushed the remove-oci-max-parallelism branch 2 times, most recently from c533cee to 3e145f0 Compare May 16, 2024 17:28
This is not an option that we should support. Justin puts it best:

    I think with (Dagger) Modules, this effectively means that
    `max-parallelism` is pretty much just completely broken - it just fits
    very badly with the idea of nested containers.

    ... but to me this kind of highlights that we need a better way of
    constraining resources - max-parallelism is just really not fit for
    the shape that pipelines tend to take these days. Maybe we should
    provide some cgroup config options instead (if the end desire is to
    constrain CPU usage/etc).

While this can still be configured if a custom Engine toml config is
provided, it means that users went out of their way to have this. It's
hard on purpose and we don't recommend using this setting anymore.

This also removes it from the Helm chart.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard gerhard force-pushed the remove-oci-max-parallelism branch from 3e145f0 to 5bc265f Compare May 21, 2024 16:48
@gerhard
Copy link
Member Author

gerhard commented May 21, 2024

Speaking with @aluzzardi, these checks may have failed due to a ClickHouse migration which killed some in-flight traces. Rebasing & force pushing so that we force re-run all failed checks.

@jedevc
Copy link
Member

jedevc commented May 22, 2024

If possible, I'd like to hold off on merging this. If I have the time, I think keeping this option would be best for people using it, but we'd need the improved semantics previously described. But we can keep this open at least as a tracking issue.

@matipan if everyone is in agreement with this change, we should remove this from all our infra. We can do that before the release goes out.

IMO, we should do this as soon as possible - it's definitely a potential candidate for some of the weird ephemeral CI failures we've seen. With this option it's possible to get every running test is a stuck state (if the right combination of tests all start at the right time) - which prevents any progress being made (which sounds like a symptom we've seen before).

@sipsma
Copy link
Contributor

sipsma commented May 23, 2024

Agree with @jedevc, this option is used already by multiple users and while it's totally buggy once modules are used with too low a value, that feels more like a bug than enough justification to remove this entirely.

  • BTW, @gerhard you asked about the num-cpu option; that was added because it was requested by a user a while back. It makes it easier for users to scale this setting across many different hosts with many different CPU counts without having to jump through hoops to update the engine config on each host.

I'd totally be in favor of adding a scary warning message to the help for this though and tracking a long term fix to the problem.

  • Actually now that I think about it, all the session related refactoring work should make fixing this a lot more tenable (what would have been a large project to medium at most).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

馃悶 Setting --oci-max-parallelism 2 causes deadlocks
3 participants