-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core|RayOnSpark] finding ports fails when launching Ray on Spark and verbose logs #45409
Comments
@jjyao can you help triage this? Thanks |
cc @WeichenXu123 can you take a look at this one? |
@WeichenXu123 gentle ping here. |
checking |
ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ? we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node. i.e. one spark worker node launches at most one Ray worker node. This reduces risks of port conflicts. we have mechanism to prevent port conflicts. ray/python/ray/util/spark/cluster_init.py Line 287 in 0be0639
according to your error message |
ports > 30000 might be used by other Ray components. Reducing Ray worker nodes number per spark worker node should address the issue. @jjyao Does Ray system service use port range 10000 ~ 20000 ? |
Thanks for having a look @WeichenXu123 . I will answer your questions below and add some thoughts:
We have 100+ nodes, so it should be fine to have 70 ray workers. But yes I do see Ray often putting more than one worker on the same node. How can I tell Ray not to without telling it to use all the resources of the machine?
I think there might be a few issues with this: the nodes in the cluster might be different so it will be hard. Also, it might take longer for my ray task to launch if it has to wait for a full machine to be available instead of just using whatever is available. |
What happened + What you expected to happen
When trying to use Ray on Spark (docs here: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/spark.html) I often see very spammy logs about not being able to bind to ports. Sometimes it works anyways and sometimes it fails. Is there a way to make this a bit more robust and less verbose?
and these too:
Versions / Dependencies
versions:
on Ubuntu
Reproduction script
Following the docs for Ray on Spark (https://docs.ray.io/en/latest/cluster/vms/user-guides/community/spark.html) I created a context manager that is called like with these settings:
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: