A.X K1
Model Summary
A.X K1 is a large-scale Mixture-of-Experts (MoE) language model designed for efficient high-capacity reasoning and instruction following. The model contains 519 billion total parameters, with 33 billion active parameters, enabling strong performance while maintaining practical inference efficiency.
This hybrid design allows the user to choose between in-depth reasoning and response latency depending on task requirements.
Key Features
Large-Scale Sparse MoE (519B / 33B Active) Employs a high-capacity Mixture-of-Experts architecture that activates only a small subset of experts per token, enabling strong reasoning performance with practical inference efficiency.
Hybrid Reasoning Control (Think / Non-Think) Supports user-controllable reasoning depth, allowing explicit multi-step reasoning or concise low-latency responses within a single unified model.
Tokenizer Optimized for Multilingual and Code Data Uses a large-vocabulary BBPE-based tokenizer optimized for token efficiency across five languages (English, Korean, Chinese, Japanese, and Spanish), with a strong emphasis on source code, structured text, and programming-related patterns.
Stability-Oriented Architecture for Large-Scale MoE Incorporates RMSNorm both before and after MLP (MoE) blocks, improving training stability and robustness in sparse, long-context settings.
Model Details
- Architecture: Decoder-only Transformer with Mixture-of-Experts (MoE)
- Total parameters: 519B (192 experts + 1 shared expert)
- Active parameters: 33B per token (8 experts + 1 shared expert)
- Number of layers: 61 (1 dense + 60 MoE)
- Number of attention heads: 64
- Intermediate size: 7168
- Expert intermediate size: 2048
- Normalization: RMSNorm applied both before and after the MLP block
- Attention: Multi-Latent Attention (MLA)
- Vocab size: 163,840
- Context length: 131,072 tokens
Architecture Highlights
Mixture-of-Experts Design
A.X K1 follows a sparse Mixture-of-Experts architecture in which only a subset of experts is activated per token. This design substantially increases model capacity while keeping the computational cost comparable to dense models with much smaller parameter counts.
From a scalability and efficiency perspective, MoE architectures enable model capacity to grow primarily by adding experts, with substantially slower growth in compute compared to dense models. Expert parallelism allows experts to be distributed across devices, supporting large-scale training and serving without activating all parameters on every forward pass. Recent MoE scaling-law studies provide guidance for selecting the number of experts and activation ratios under fixed compute and memory budgets.
Hybrid Reasoning Fusion (Think / Non-Think)
A.X K1 uses a single model to generate responses where reasoning before the answer can be enabled or disabled depending on usage requirements. This design supports controlled trade-offs between reasoning depth and response latency.
- Think mode: Generates reasoning steps before producing the answer for complex problem solving and multi-step inferences.
- Non-Think mode: Generates concise, direct responses optimized for low-latency usage.
Post-MLP RMSNorm
A.X K1 incorporates an additional RMSNorm applied after the MLP (MoE) block in each Transformer layer. This design choice improves training stability in large-scale sparse MoE settings and enhances robustness for both reasoning-intensive and long-context generations.
Multi-Token Prediction (MTP)
During training, A.X K1 employs a multi-token prediction objective in which the model predicts one future token beyond the standard next-token objective from a single forward pass, serving as an auxiliary signal that helps stabilize training for large-scale models. At inference time, MTP does not modify the standard autoregressive decoding process, but provides benefits for speculative decoding, enabling higher inference throughput when used with compatible serving frameworks.
Evaluation
Model evaluation results are planned to be released publicly alongside the technical report.
Running Locally
A.X K1 can be served with SGLang and vLLM. The optimal configuration depends on the runtime and version, GPU type and memory, and system-level factors such as networking and infrastructure. Validated configurations will be shared as upstream support and benchmarks mature.
Limitations
- A.X K1 may generate incorrect or misleading information due to its stochastic nature.
- Reasoning outputs in Think mode should not be interpreted as faithful representations of the model’s internal decision process.
- Performance may vary across domains and languages depending on data coverage.
Citation
If you use A.X K1 in your research, please cite the technical report:
@techreport{axk1,
title = {A.X K1 Technical Report},
author = {{SK Telecom}},
institution = {SK Telecom},
year = {2025},
month = {January},
note = {Technical report, to appear}
}
- Downloads last month
- 447