TRELLIS.2 on Mac: Nvidia's CUDA Lock Broken by 9 Files

Shivam Kumar spent a weekend swapping CUDA kernels for pure PyTorch and made Microsoft's flagship 3D model run on an M3 MacBook Air. The original TRELLIS.2 required flash_attn, nvdiffrast, and custom sparse convolutions—none of which exist outside Nvidia's ecosystem. This is the first credible demonstration that Apple Silicon can run state-of-the-art 3D generative AI without emulation or cloud offloading.

Microsoft's TRELLIS.2, a 4B-parameter image-to-3D model, now runs on Apple Silicon via a pure-PyTorch port by Shivam Kumar.
The original required CUDA-specific ops (flash_attn, nvdiffrast, custom sparse conv) that block Mac, Linux CPU, and AMD GPU users.
Kumar replaced them with gather-scatter sparse convolutions, SDPA attention, and Python mesh extraction—just 9 files changed.
This proves that proprietary CUDA kernels are a choice, not a necessity, and opens 3D AI to a much wider hardware base.

Why Did Microsoft Build TRELLIS.2 for Nvidia Only?

The original TRELLIS.2, released in March 2026, is a masterpiece of CUDA-specific engineering. It relies on flash_attn (Nvidia's fused attention kernels), nvdiffrast (Nvidia's differentiable rasterizer), and custom sparse 3D convolution kernels written in CUDA. These are not available on Mac, AMD, or Intel GPUs. Microsoft's decision to lock the model to Nvidia hardware is a strategic bet on the dominant AI chipmaker, but it excludes the ~15% of developers using Macs (per Stack Overflow 2025 survey). The port by Kumar shows that the lock was artificial—the model's architecture is not fundamentally tied to Nvidia.

What Does This Mean for the Image-to-3D Market?

The image-to-3D space is crowded: OpenAI's Point-E, Nvidia's GET3D, and Stability AI's Stable Zero123 all require CUDA. Kumar's port creates a new competitive axis: hardware independence. Developers on Mac can now generate 3D assets locally, without cloud costs or Nvidia GPUs. This lowers the barrier for indie game developers, VR creators, and 3D printing enthusiasts who use Apple hardware. The theoretical impact is a democratization of 3D asset generation, moving it from server farms to laptops.

TRELLIS.2 on Mac: Nvidias CUDA Lock Broken by 9 Files

Who Wins and Who Loses in This Port?

Winners: Apple Silicon users (M1/M2/M3/M4), especially those in 3D content creation, game dev, and education. Microsoft's research division gains visibility among non-Nvidia users. PyTorch itself benefits as a platform-agnostic framework. Losers: Nvidia, which loses its CUDA lock-in advantage in 3D AI. Any startup building image-to-3D services that rely on CUDA-only models (e.g., Luma AI, Meshy) now face competition from a free, local alternative. The port also puts pressure on AMD and Intel to improve their GPU software stacks.

Feature	Original TRELLIS.2	Mac Port (trellis-mac)
Hardware required	Nvidia GPU (CUDA 12+)	Apple Silicon (MPS)
Attention mechanism	flash_attn (CUDA)	SDPA (PyTorch native)
3D convolution	Custom CUDA sparse conv	Gather-scatter PyTorch
Mesh extraction	CUDA hashmap	Python-based
Code changes	N/A (baseline)	~300 lines across 9 files
Performance	Fast (native CUDA)	Slower but functional
Verdict	Nvidia-only, fast	Cross-platform, accessible

My thesis is that this port exposes the fragility of CUDA-only AI models and validates Apple's MPS strategy. Short-term, this is a niche win for Mac developers who want to experiment with 3D generation without buying a $3,000 Nvidia card. Long-term, it pressures Microsoft to release platform-agnostic versions of its research models—or risk losing relevance among non-Nvidia developers. The winner here is PyTorch, which proves it can bridge the gap between CUDA and MPS. The loser is Nvidia's ecosystem, which relies on developers not asking the question: 'Can I run this without CUDA?' Kumar answered that question with a resounding yes. I expect Apple to accelerate MPS optimizations for 3D workloads by WWDC 2026 in June, potentially integrating similar sparse convolution support directly into Metal Performance Shaders.

By WWDC 2026 (June), Apple will announce native Metal Performance Shaders for sparse 3D convolutions, directly inspired by Kumar's port.
Microsoft will release an official non-CUDA version of TRELLIS.2 within 6 months, or risk losing developer mindshare to community ports.
At least two image-to-3D startups (likely Meshy or Luma AI) will announce Mac-native clients by Q4 2026, citing this port as proof of feasibility.

March 2026
Microsoft releases TRELLIS.2
4B-parameter image-to-3D model, CUDA-only with flash_attn, nvdiffrast, custom sparse conv.
April 20, 2026
trellis-mac port published
Shivam Kumar ports TRELLIS.2 to Apple Silicon via pure PyTorch, replacing all CUDA-specific ops.
Expected June 2026
Apple WWDC 2026
Potential announcement of MPS optimizations for 3D sparse convolutions.

Microsoft's CUDA lock-in is a strategic weakness, not a technical necessity.
A single developer's weekend project can outpace corporate deployment strategy.
Apple Silicon is now a viable platform for 4B-parameter 3D models—this changes the laptop-as-workstation equation.
The real bottleneck for AI accessibility is not model size, but software stack dependencies.
Expect a wave of 'Mac-first' 3D AI tools within 12 months.