Tech3h ago78% confidence

High-Performance Expert Parallelism Kernels for Large Language Model Inference

1 source

A technical analysis explains how modern GPU clusters handle expert parallelism (EP) in mixture-of-experts (MoE) language models, focusing on the communication kernels that route tokens to distributed experts. Expert parallelism differs from other GPU parallelization methods because routing decisions are made dynamically at runtime based on data, rather than following fixed communication patterns. This matters because efficient EP kernels are critical for serving large MoE models like DeepSeek at production scale across multiple GPUs.

The article provides a detailed technical breakdown of how expert parallelism kernels work in distributed LLM inference systems. Unlike tensor parallelism or pipeline parallelism where communication patterns are predetermined, expert parallelism requires dynamic routing: a router network decides which tokens should be processed by which experts based on learned logits, and these routing decisions vary with each forward pass. The article uses a concrete example of 8 GPUs across 2 nodes with 16 experts (2 per GPU) and 2 routed experts per token to illustrate the dispatch and combine operations. It references DeepSeek's production deployment achieving 2.2k tokens/second per GPU on H200 clusters using wide expert parallelism combined with data-parallel attention, and notes that DeepSeek's DeepEP library established the modern architecture for these kernels. The post promises to build up both high-throughput and low-latency kernel designs.

What different sources said

Hacker NewsCenter
Anatomy of a high-performance EP kernel

Tech

BYD Demonstrates Ultra-Fast 9-Minute EV Charging Technology at UK Headquarters

BYD showcased its Flash Charge technology at its West London headquarters, charging a Denza Z9 GT from 10% to nearly 100% in nine minutes using 1,500kW peak power. The system uses CCS 2 connectors compatible with most EVs and includes on-site battery storage to reduce grid demand. BYD plans to deploy 6,000 Flash Charging stalls globally by end of 2027, with 3,000 in Europe and 300 in the UK, potentially offering charging at under 50 pence per kilowatt-hour.

1 source5m ago

Tech

Anthropic's Claude Fable 5 Model Blocking Harmless User Requests with Overly Strict Safety Filters

Anthropic's newly released Claude Fable 5 AI model is refusing to respond to innocuous user prompts, including simple greetings like "hello," due to overly conservative safety guardrails. The company acknowledged the issue and stated that false positives occur in less than 5% of sessions, but has not provided exact refusal rates. The problem affects millions of users and has generated numerous bug reports and complaints from researchers and developers.

1 source5m ago

Tech

Open-Source Raspberry Pi Project Recreates Retro VCR Interface for Modern Media Playback

Developer Anthony Caccese has released 240-MP, an open-source Raspberry Pi project that creates a vintage VCR-style interface for playing local media files and Plex libraries on CRT or modern screens. The project runs on Raspberry Pi 4B, 3B+, and 3B models and supports navigation via remote control or keyboard. The tool addresses nostalgia for older display formats while enabling modern streaming functionality.

1 source5m ago

High-Performance Expert Parallelism Kernels for Large Language Model Inference

What different sources said

Related

BYD Demonstrates Ultra-Fast 9-Minute EV Charging Technology at UK Headquarters

Anthropic's Claude Fable 5 Model Blocking Harmless User Requests with Overly Strict Safety Filters

Open-Source Raspberry Pi Project Recreates Retro VCR Interface for Modern Media Playback