I work on making datacenter systems fast, efficient, and resilient — from optical datacenter networks to distributed ML training. I’m a PhD student at the Max Planck Institute for Informatics (MPI-INF), advised by Yiting Xia. I work across the stack, including training frameworks, collectives and parallelism, networking stacks, precise time synchronization, and FPGA acceleration.
Highlights
- Publications: 2× NSDI first-author papers (OpenOptics — democratizing optical DCNs; SyncWise — sub-10ns clock accuracy).
- Open-source systems: OpenOptics — primary contributor, ~30k LoC, modular architecture, comprehensive docs, tutorial at SIGCOMM’25.
- ML systems: Phoenix (checkpoint-less JAX recovery, AWS AI internship) and a feature upstreamed in JAX PR #36613.
News
- OpenOptics (website) has been accepted at NSDI’26!
- SyncWise has been accepted at NSDI’26!
- We hosted a tutorial on OpenOptics at SIGCOMM’25.
Selected Publications
-
[Under Submission] Phoenix: Checkpoint-less Failure Recovery for Auto-parallelism.
-
[NSDI’26] OpenOptics: Enabling Open Research and Implementation of Optical Data Center Networks. (paper, website)
Yiming Lei, Federico De Marchi, Raj Joshi, Jialong Li, Balakrishnan Chandrasekaran, Yiting Xia. -
[NSDI’26] SyncWise: Error-Aware Time Synchronization for Reconfigurable Data Center Networks. (paper)
Yiming Lei, Jialong Li, Zhengqing Liu, Raj Joshi, Yiting Xia. -
[HotNets’22] Efficient Flow Scheduling in Distributed Deep Learning Training with Echelon Formation. (paper)
Rui Pan*, Yiming Lei*, Jialong Li, Zhiqiang Xie, Binhang Yuan, Yiting Xia. (*Equal Contributions).
Engineering
OpenOptics (GitHub, paper, NSDI'26) — realize customized optical data center networks with ~10 lines of Python.
- Primary contributor — shipped ~30k LoC.
- Modular architecture spanning topology, routing, and monitoring, with multi-backend support: Mininet, ns-3, and Tofino.
- Comprehensive documentation; tutorial given at SIGCOMM'25.
Phoenix (Under Submission) — Checkpoint-less failure recovery for JAX auto-parallelism. Built during my AWS AI internship; recovers GSPMD/pjit training without periodic checkpointing.
JAX upstream contribution — jax-ml/jax#36613: added a ProcessFailureError to JAX’s live_devices context manager so callers can identify which devices died on a failure — a primitive that resilient training systems build on top of.
Other Projects
Digital Molecular Computer — A specialized processor for boolean satisfiability problem (SAT) inspired by molecular computing. Prototyped with Verilog and FPGA.
Experience
- Oct 2021 – Present
PhD Student, Max Planck Institute for Informatics - Sep 2024 – Mar 2025
Applied Scientist Intern, AWS AI — built checkpoint-less failure recovery for JAX-based LLM training (Phoenix). - Jul 2020 – Mar 2021
Research Assistant, University of Illinois Urbana-Champaign - Sep 2019 – Feb 2020
Exchange Student, Institut supérieur d’électronique de Paris (ISEP) - Sep 2017 – Jun 2021
B.Sc in Computer Science, Beijing University of Posts and Telecommunications
Misc.
Outside the office, you’ll often find me playing tennis, bouldering, hiking, experimenting in the kitchen, or hanging out with my cat.
