Reasoning

Chain of Thought

Test-time compute (inference-time compute): prompting models to generate intermediate reasoning steps dramatically improved performance on hard problems:

Long CoT and inference-time scaling: 推理模型不是直接生成最终答案, 而是生成一个详细描述其推理过程的长 CoT. 通过控制长 CoT 的长度, 可以控制计算成本, 动态控制推理能力.
Reasoning model can self-evolution with RL and need less supervision.

tip

Thinking tokens are model's only persistent memory during reasoning.

Inference Acceleration

Quantization: 改变模型权重和激活值的精度.
Distillation: data, knowledge, on policy.
Flash attention: minimize data move between slow HBM to faster memory tier (SRAM/VMEM).
Prefix caching: avoid recalculating attention scores for input on each auto-regressive decode step.
Speculative decoding: generate multiple candidates with drafter model (much smaller).
Batching and parallelization: sequence, pipeline, tensor.

Chain of Thought​

Inference Acceleration​

Chain of Thought

Inference Acceleration