[Core|RayOnSpark] finding ports fails when launching Ray on Spark and verbose logs #45409

fersarr · 2024-05-17T10:49:47Z

What happened + What you expected to happen

When trying to use Ray on Spark (docs here: https://docs.ray.io/en/latest/cluster/vms/user-guides/community/spark.html) I often see very spammy logs about not being able to bind to ports. Sometimes it works anyways and sometimes it fails. Is there a way to make this a bit more robust and less verbose?

(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,644 E 64622 64622] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use [system:98 at external/boost/boost/asio/detail/reactive_socket_service.hpp:161 in function 'bind']
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,706 E 64622 64622] (raylet) logging.cc:104: Stack trace:
(raylet, ip=xx.xx.xx.xx)  /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb8391a) [0x5638ddfc391a] ray::operator<<()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xb85f58) [0x5638ddfc5f58] ray::TerminateHandler()                                                                                                                                          (raylet, ip=xx.xx.xx.xx) /lib64/libstdc++.so.6(+0x5ea06) [0x7f2816d40a06]
(raylet, ip=xx.xx.xx.xx) /lib64/libstdc++.so.6(+0x5ea33) [0x7f2816d40a33]
(raylet, ip=xx.xx.xx.xx) /lib64/libstdc++.so.6(+0x5ec53) [0x7f2816d40c53]
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1c7896) [0x5638dd607896] boost::throw_exception<>()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc6852b) [0x5638de0a852b] boost::asio::detail::do_throw_error()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x30fe72) [0x5638dd74fe72] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_15any_io_executorEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbNS0_10co
nstraintIXsrSt14is_convertibleIS9_RNS0_17execution_contextEE5valueEiE4typeE
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x310a7e) [0x5638dd750a7e] ray::raylet::Raylet::Raylet()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x253260) [0x5638dd693260] main::{lambda()#1}::operator()()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2546d3) [0x5638dd6946d3] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x46bb58) [0x5638dd8abb58] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x4b6da4) [0x5638dd8f6da4] ray::rpc::GcsRpcClient::GetInternalConfig()::{lambda()#2}::operator()()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x45cd05) [0x5638dd89cd05] ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x273e15) [0x5638dd6b3e15] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5b40fe) [0x5638dd9f40fe] EventTracker::RecordExecution()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad4ee) [0x5638dd9ed4ee] std::_Function_handler<>::_M_invoke()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x5ad966) [0x5638dd9ed966] boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc6519b) [0x5638de0a519b] boost::asio::detail::scheduler::do_run_one()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67729) [0x5638de0a7729] boost::asio::detail::scheduler::run()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0xc67c42) [0x5638de0a7c42] boost::asio::io_context::run()
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x1cfafa) [0x5638dd60fafa] main
(raylet, ip=xx.xx.xx.xx) /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f2816720555] __libc_start_main
(raylet, ip=xx.xx.xx.xx) /venv/lib/python3.8/site-packages/ray/core/src/ray/raylet/raylet(+0x2249c7) [0x5638dd6649c7]
(raylet, ip=xx.xx.xx.xx)
(raylet, ip=xx.xx.xx.xx) *** SIGABRT received at time=1715942221 on cpu 12 ***
(raylet, ip=xx.xx.xx.xx) PC: @     0x7f2816734387  (unknown)  raise
(raylet, ip=xx.xx.xx.xx)     @     0x7f2817503630  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx)     @     0x7f2816d40a06  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx)     @     0x5638de261c80  950501776  (unknown)
(raylet, ip=xx.xx.xx.xx)     @     0x7f2816d41fb0  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx)     @ 0x3de907894810c083  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,710 E 64622 64622] (raylet) logging.cc:361: *** SIGABRT received at time=1715942221 on cpu 12 ***
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,710 E 64622 64622] (raylet) logging.cc:361: PC: @     0x7f2816734387  (unknown)  raise
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,711 E 64622 64622] (raylet) logging.cc:361:     @     0x7f2817503630  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,711 E 64622 64622] (raylet) logging.cc:361:     @     0x7f2816d40a06  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,714 E 64622 64622] (raylet) logging.cc:361:     @     0x5638de261c80  950501776  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,714 E 64622 64622] (raylet) logging.cc:361:     @     0x7f2816d41fb0  (unknown)  (unknown)
(raylet, ip=xx.xx.xx.xx) [2024-05-17 11:37:01,715 E 64622 64622] (raylet) logging.cc:361:     @ 0x3de907894810c083  (unknown)  (unknown)

and these too:


  File "/venv/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/venv/lib/python3.8/site-packages/ray/scripts/scripts.py", line 928, in start
    node = ray._private.node.Node(
  File "/venv/lib/python3.8/site-packages/ray/_private/node.py", line 302, in __init__
    self._ray_params.update_pre_selected_port()
  File "/venv/lib/python3.8/site-packages/ray/_private/parameter.py", line 347, in update_pre_selected_port
    raise ValueError(
ValueError: Ray component worker_ports is trying to use a port number 35260 that is used by other components.
Port information: {'gcs': 'random', 'object_manager': 'random', 'node_manager': 'random', 'gcs_server': 'random', 'client_server': 10001, 'dashboard': 8265, 'dashboard_agent_grpc': 60800, 'dashboard_agent_http': 10577, 'dashboard_grpc': 'random', 'runtime_env_agent': 35260, 'metrics_export': 38742, 'redis_shards': 'ran
dom', 'worker_ports': '1000 ports from 35000 to 35999'}
If you allocate ports, please make sure the same port is not used by multiple components.

Versions / Dependencies

versions:

python 3.8.12
ray 2.9.3
pyspark 3.4.1

on Ubuntu

Reproduction script

Following the docs for Ray on Spark (https://docs.ray.io/en/latest/cluster/vms/user-guides/community/spark.html) I created a context manager that is called like with these settings:

with create_ray_on_spark(
        min_worker_nodes=4,
        max_worker_nodes=70,
        num_cpus_worker_node=4,
        spark_conf_dict={"spark.executor.memory": "20g"}
    ) as (ray_connection_str, sc):

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

rynewang · 2024-05-20T22:50:54Z

@jjyao can you help triage this? Thanks

jjyao · 2024-05-21T04:03:03Z

cc @WeichenXu123 can you take a look at this one?

jjyao · 2024-05-28T21:12:41Z

@WeichenXu123 gentle ping here.

WeichenXu123 · 2024-05-29T09:41:22Z

checking

WeichenXu123 · 2024-05-29T09:48:24Z

ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?

we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node. i.e. one spark worker node launches at most one Ray worker node. This reduces risks of port conflicts.

we have mechanism to prevent port conflicts.

ray/python/ray/util/spark/cluster_init.py

Line 287 in 0be0639

def _preallocate_ray_worker_port_range():

according to your error message '1000 ports from 35000 to 35999', you must have started at least 16 Ray worker nodes in the same machine .

WeichenXu123 · 2024-05-29T09:52:09Z

ports > 30000 might be used by other Ray components. Reducing Ray worker nodes number per spark worker node should address the issue. @jjyao Does Ray system service use port range 10000 ~ 20000 ?

fersarr · 2024-05-29T11:23:08Z

Thanks for having a look @WeichenXu123 . I will answer your questions below and add some thoughts:

ah , you are creating 70 Ray worker nodes , but how many spark worker nodes there ?

We have 100+ nodes, so it should be fine to have 70 ray workers. But yes I do see Ray often putting more than one worker on the same node. How can I tell Ray not to without telling it to use all the resources of the machine?

we recommend to make one Ray worker node occupies all CPU/GPUs in a spark worker node

I think there might be a few issues with this: the nodes in the cluster might be different so it will be hard. Also, it might take longer for my ray task to launch if it has to wait for a full machine to be available instead of just using whatever is available.
Also, If I somehow configure ray to use all CPU/GPU in the spark worker node that will potentially lead to underutilisation of the cluster right? We have really big nodes, sometimes with more than 150 cores.

fersarr added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 17, 2024

anyscalesam added the core Issues that should be addressed in Ray Core label May 20, 2024

rynewang assigned jjyao May 20, 2024

jjyao assigned jjyao and unassigned jjyao May 28, 2024

jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core|RayOnSpark] finding ports fails when launching Ray on Spark and verbose logs #45409

[Core|RayOnSpark] finding ports fails when launching Ray on Spark and verbose logs #45409

fersarr commented May 17, 2024

rynewang commented May 20, 2024

jjyao commented May 21, 2024

jjyao commented May 28, 2024

WeichenXu123 commented May 29, 2024

WeichenXu123 commented May 29, 2024 •

edited

WeichenXu123 commented May 29, 2024

fersarr commented May 29, 2024 •

edited

[Core|RayOnSpark] finding ports fails when launching Ray on Spark and verbose logs #45409

[Core|RayOnSpark] finding ports fails when launching Ray on Spark and verbose logs #45409

Comments

fersarr commented May 17, 2024

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

rynewang commented May 20, 2024

jjyao commented May 21, 2024

jjyao commented May 28, 2024

WeichenXu123 commented May 29, 2024

WeichenXu123 commented May 29, 2024 • edited

WeichenXu123 commented May 29, 2024

fersarr commented May 29, 2024 • edited

WeichenXu123 commented May 29, 2024 •

edited

fersarr commented May 29, 2024 •

edited