I work on making datacenter systems fast, efficient, and resilient — from optical datacenter networks to distributed ML training. I’m a PhD student at the Max Planck Institute for Informatics (MPI-INF), advised by Yiting Xia. I work across the stack, including training frameworks, collectives and parallelism, networking stacks, precise time synchronization, and FPGA acceleration.

Highlights

  • Publications: 2× NSDI first-author papers (OpenOptics — democratizing optical DCNs; SyncWise — sub-10ns clock accuracy).
  • Open-source systems: OpenOptics — primary contributor, ~30k LoC, modular architecture, comprehensive docs, tutorial at SIGCOMM’25.
  • ML systems: Phoenix (checkpoint-less JAX recovery, AWS AI internship) and a feature upstreamed in JAX PR #36613.

News

Selected Publications

  • [Under Submission] Phoenix: Checkpoint-less Failure Recovery for Auto-parallelism.

  • [NSDI’26] OpenOptics: Enabling Open Research and Implementation of Optical Data Center Networks. (paper, website)
    Yiming Lei, Federico De Marchi, Raj Joshi, Jialong Li, Balakrishnan Chandrasekaran, Yiting Xia.

  • [NSDI’26] SyncWise: Error-Aware Time Synchronization for Reconfigurable Data Center Networks. (paper)
    Yiming Lei, Jialong Li, Zhengqing Liu, Raj Joshi, Yiting Xia.

  • [HotNets’22] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation. (paper)
    Rui Pan*, Yiming Lei*, Jialong Li, Zhiqiang Xie, Binhang Yuan, Yiting Xia. (*Equal Contributions).

Engineering

OpenOptics Logo

OpenOptics (GitHub, paper, NSDI'26) — realize customized optical data center networks with ~10 lines of Python.

  • Primary contributor — shipped ~30k LoC.
  • Modular architecture spanning topology, routing, and monitoring, with multi-backend support: Mininet, ns-3, and Tofino.
  • Comprehensive documentation; tutorial given at SIGCOMM'25.

Phoenix (Under Submission) — Checkpoint-less failure recovery for JAX auto-parallelism. Built during my AWS AI internship; recovers GSPMD/pjit training without periodic checkpointing.

JAX upstream contributionjax-ml/jax#36613: added a ProcessFailureError to JAX’s live_devices context manager so callers can identify which devices died on a failure — a primitive that resilient training systems build on top of.

Other Projects

Digital Molecular Computer — A specialized processor for boolean satisfiability problem (SAT) inspired by molecular computing. Prototyped with Verilog and FPGA.

Experience

Misc.

Outside the office, you’ll often find me playing tennis, bouldering, hiking, experimenting in the kitchen, or hanging out with my cat.

Mengmeng