Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core|RayOnSpark] finding ports fails when launching Ray on Spark and verbose logs #45409

Open
fersarr opened this issue May 17, 2024 · 7 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@fersarr
Copy link

fersarr commented May 17, 2024

What happened + What you expected to happen

When trying to use Ray on Spark (docs here: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/spark.html) I often see very spammy logs about not being able to bind to ports. Sometimes it works anyways and sometimes it fails. Is there a way to make this a bit more robust and less verbose?

(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,644 E 64622 64622] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use [system:98 at external/boost/boost/asio/detail/reactive_socket_service.hpp:161 in function 'bind']
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,706 E 64622 64622] (raylet) logging.cc:104: Stack trace:
(raylet, ip=xx.xx.xx.xx)  /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb8391a) [0x5638ddfc391a] ray::operator<<()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb85f58) [0x5638ddfc5f58] ray::TerminateHandler()                                                                                                                                          (raylet, ip=xx.xx.xx.xx) /lib64/libstdc++.so.6(+0x5ea06) [0x7f2816d40a06]
(raylet, ip=xx.xx.xx.xx) /lib64/libstdc++.so.6(+0x5ea33) [0x7f2816d40a33]
(raylet, ip=xx.xx.xx.xx) /lib64/libstdc++.so.6(+0x5ec53) [0x7f2816d40c53]
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1c7896) [0x5638dd607896] boost::throw_exception<>()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc6852b) [0x5638de0a852b] boost::asio::detail::do_throw_error()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x30fe72) [0x5638dd74fe72] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_15any_io_executorEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbNS0_10co
nstraintIXsrSt14is_convertibleIS9_RNS0_17execution_contextEE5valueEiE4typeE
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x310a7e) [0x5638dd750a7e] ray::raylet::Raylet::Raylet()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x253260) [0x5638dd693260] main::{lambda()#1}::operator()()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2546d3) [0x5638dd6946d3] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x46bb58) [0x5638dd8abb58] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x4b6da4) [0x5638dd8f6da4] ray::rpc::GcsRpcClient::GetInternalConfig()::{lambda()#2}::operator()()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x45cd05) [0x5638dd89cd05] ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x273e15) [0x5638dd6b3e15] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5b40fe) [0x5638dd9f40fe] EventTracker::RecordExecution()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad4ee) [0x5638dd9ed4ee] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad966) [0x5638dd9ed966] boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc6519b) [0x5638de0a519b] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67729) [0x5638de0a7729] boost::asio::detail::scheduler::run()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67c42) [0x5638de0a7c42] boost::asio::io_context::run()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1cfafa) [0x5638dd60fafa] main
(raylet, ip=xx.xx.xx.xx) /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f2816720555] __libc_start_main
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2249c7) [0x5638dd6649c7]
(raylet, ip=xx.xx.xx.xx)
(raylet, ip=xx.xx.xx.xx) *** SIGABRT received at time=1715942221 on cpu 12 ***
(raylet, ip=xx.xx.xx.xx) PC: @     0x7f2816734387  (unknown)  raise
(raylet, ip=xx.xx.xx.xx)     @     0x7f2817503630  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx)     @     0x7f2816d40a06  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx)     @     0x5638de261c80  950501776  (unknown)
(raylet, ip=xx.xx.xx.xx)     @     0x7f2816d41fb0  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx)     @ 0x3de907894810c083  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,710 E 64622 64622] (raylet) logging.cc:361: *** SIGABRT received at time=1715942221 on cpu 12 ***
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,710 E 64622 64622] (raylet) logging.cc:361: PC: @     0x7f2816734387  (unknown)  raise
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,711 E 64622 64622] (raylet) logging.cc:361:     @     0x7f2817503630  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,711 E 64622 64622] (raylet) logging.cc:361:     @     0x7f2816d40a06  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,714 E 64622 64622] (raylet) logging.cc:361:     @     0x5638de261c80  950501776  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,714 E 64622 64622] (raylet) logging.cc:361:     @     0x7f2816d41fb0  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,715 E 64622 64622] (raylet) logging.cc:361:     @ 0x3de907894810c083  (unknown)  (unknown)

and these too:


  File "/venv/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/venv/lib/python3.8/site-packages/ray/scripts/scripts.py", line 928, in start
    node = ray._private.node.Node(
  File "/venv/lib/python3.8/site-packages/ray/_private/node.py", line 302, in __init__
    self._ray_params.update_pre_selected_port()
  File "/venv/lib/python3.8/site-packages/ray/_private/parameter.py", line 347, in update_pre_selected_port
    raise ValueError(
ValueError: Ray component worker_ports is trying to use a port number 35260 that is used by other components.
Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 'random', 'client_server': 10001, 'dashboard': 8265, 'dashboard_agent_grpc': 60800, 'dashboard_agent_http': 10577, 'dashboard_grpc': 'random', 'runtime_env_agent': 35260, 'metrics_export': 38742, 'redis_shards': 'ran
dom', 'worker_ports': '1000 ports from 35000 to 35999'}
If you allocate ports, please make sure the same port is not used by multiple components.

Versions / Dependencies

versions:

python 3.8.12
ray 2.9.3
pyspark 3.4.1

on Ubuntu

Reproduction script

Following the docs for Ray on Spark (https://docs.ray.io/en/latest/cluster/vms/user-guides/community/spark.html) I created a context manager that is called like with these settings:

with create_ray_on_spark(
        min_worker_nodes=4,
        max_worker_nodes=70,
        num_cpus_worker_node=4,
        spark_conf_dict={"spark.executor.memory": "20g"}
    ) as (ray_connection_str, sc):

Issue Severity

High: It blocks me from completing my task.

@fersarr fersarr added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 17, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 20, 2024
@rynewang
Copy link
Contributor

@jjyao can you help triage this? Thanks

@jjyao
Copy link
Contributor

jjyao commented May 21, 2024

cc @WeichenXu123 can you take a look at this one?

@jjyao jjyao assigned jjyao and unassigned jjyao May 28, 2024
@jjyao
Copy link
Contributor

jjyao commented May 28, 2024

@WeichenXu123 gentle ping here.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 28, 2024
@WeichenXu123
Copy link
Contributor

checking

@WeichenXu123
Copy link
Contributor

WeichenXu123 commented May 29, 2024

ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?

we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node. i.e. one spark worker node launches at most one Ray worker node. This reduces risks of port conflicts.

we have mechanism to prevent port conflicts.

def _preallocate_ray_worker_port_range():

according to your error message '1000 ports from 35000 to 35999', you must have started at least 16 Ray worker nodes in the same machine .

@WeichenXu123
Copy link
Contributor

ports > 30000 might be used by other Ray components. Reducing Ray worker nodes number per spark worker node should address the issue. @jjyao Does Ray system service use port range 10000 ~ 20000 ?

@fersarr
Copy link
Author

fersarr commented May 29, 2024

Thanks for having a look @WeichenXu123 . I will answer your questions below and add some thoughts:

ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?

We have 100+ nodes, so it should be fine to have 70 ray workers. But yes I do see Ray often putting more than one worker on the same node. How can I tell Ray not to without telling it to use all the resources of the machine?

we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node

I think there might be a few issues with this: the nodes in the cluster might be different so it will be hard. Also, it might take longer for my ray task to launch if it has to wait for a full machine to be available instead of just using whatever is available.
Also, If I somehow configure ray to use all CPU/GPU in the spark worker node that will potentially lead to underutilisation of the cluster right? We have really big nodes, sometimes with more than 150 cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

5 participants