toread
updated
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with
Fine-Grained Chinese Understanding
Paper
• 2405.08748
• Published
• 23
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
• 2405.10300
• Published
• 30
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
• 2405.09818
• Published
• 132
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
Paper
• 2405.11143
• Published
• 41
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning
Paper
• 2405.12130
• Published
• 50
FIFO-Diffusion: Generating Infinite Videos from Text without Training
Paper
• 2405.11473
• Published
• 56
Your Transformer is Secretly Linear
Paper
• 2405.12250
• Published
• 157
Matryoshka Multimodal Models
Paper
• 2405.17430
• Published
• 34
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
• 2405.15738
• Published
• 46
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Paper
• 2403.03206
• Published
• 71
BitsFusion: 1.99 bits Weight Quantization of Diffusion Model
Paper
• 2406.04333
• Published
• 38
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published
• 74
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Paper
• 2406.02657
• Published
• 41
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and
Resolution
Paper
• 2307.06304
• Published
• 35
OpenELM: An Efficient Language Model Family with Open-source Training
and Inference Framework
Paper
• 2404.14619
• Published
• 126
Multi-Head Mixture-of-Experts
Paper
• 2404.15045
• Published
• 60
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
• 2405.15574
• Published
• 55
Paper
• 2405.18407
• Published
• 48
Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Paper
• 2405.21060
• Published
• 68
CRAG -- Comprehensive RAG Benchmark
Paper
• 2406.04744
• Published
• 46
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
• 2406.08552
• Published
• 25
Paper
• 2406.09414
• Published
• 103
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
• 2406.09415
• Published
• 51
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
• 2406.10601
• Published
• 70
Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective
Distillation and Unlabeled Data Augmentation
Paper
• 2406.12849
• Published
• 50
Adam-mini: Use Fewer Learning Rates To Gain More
Paper
• 2406.16793
• Published
• 69
DreamBench++: A Human-Aligned Benchmark for Personalized Image
Generation
Paper
• 2406.16855
• Published
• 57
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
Paper
• 2407.01392
• Published
• 44
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published
• 94
Video Diffusion Alignment via Reward Gradients
Paper
• 2407.08737
• Published
• 49
Paper
• 2407.10671
• Published
• 168
Theia: Distilling Diverse Vision Foundation Models for Robot Learning
Paper
• 2407.20179
• Published
• 47
Gemma 2: Improving Open Language Models at a Practical Size
Paper
• 2408.00118
• Published
• 78
The Llama 3 Herd of Models
Paper
• 2407.21783
• Published
• 117
SAM 2: Segment Anything in Images and Videos
Paper
• 2408.00714
• Published
• 120
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published
• 92
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
• 2408.02657
• Published
• 35
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
• 2408.02718
• Published
• 62
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards
General Medical AI
Paper
• 2408.03361
• Published
• 85
An Object is Worth 64x64 Pixels: Generating 3D Object via Image
Diffusion
Paper
• 2408.03178
• Published
• 40
LLaVA-OneVision: Easy Visual Task Transfer
Paper
• 2408.03326
• Published
• 61
Transformer Explainer: Interactive Learning of Text-Generative Models
Paper
• 2408.04619
• Published
• 175
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
• 2408.06070
• Published
• 55
Qwen2-Audio Technical Report
Paper
• 2407.10759
• Published
• 64
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
• 2407.12077
• Published
• 57
Compact Language Models via Pruning and Knowledge Distillation
Paper
• 2407.14679
• Published
• 39
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published
• 40
KAN or MLP: A Fairer Comparison
Paper
• 2407.16674
• Published
• 43
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
• 2407.16655
• Published
• 30
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
• 2407.16224
• Published
• 29
MeshAnything V2: Artist-Created Mesh Generation With Adjacent Mesh
Tokenization
Paper
• 2408.02555
• Published
• 31
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
• 2407.19985
• Published
• 37
Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model
Paper
• 2407.16982
• Published
• 42
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published
• 41
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Paper
• 2408.06072
• Published
• 38
Paper
• 2408.07009
• Published
• 62
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published
• 101
MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction
Model
Paper
• 2408.10198
• Published
• 35
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
• 2408.11039
• Published
• 63
Sapiens: Foundation for Human Vision Models
Paper
• 2408.12569
• Published
• 94
DreamCinema: Cinematic Transfer with Free Camera and 3D Character
Paper
• 2408.12601
• Published
• 32
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation
Paper
• 2408.13252
• Published
• 26
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its
Teacher
Paper
• 2408.14176
• Published
• 62
Foundation Models for Music: A Survey
Paper
• 2408.14340
• Published
• 44
Diffusion Models Are Real-Time Game Engines
Paper
• 2408.14837
• Published
• 126
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
• 2408.15998
• Published
• 86
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
• 2408.16532
• Published
• 50
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
• 2409.02097
• Published
• 34
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion
Dependency
Paper
• 2409.02634
• Published
• 97
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free
Real Image Editing
Paper
• 2409.01322
• Published
• 96
Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with
Image-Based Surface Representation
Paper
• 2409.03718
• Published
• 27
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
• 2404.12387
• Published
• 39
Dynamic Typography: Bringing Words to Life
Paper
• 2404.11614
• Published
• 46
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your
Phone
Paper
• 2404.14219
• Published
• 259
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
• 2404.16710
• Published
• 80
Iterative Reasoning Preference Optimization
Paper
• 2404.19733
• Published
• 49
KAN: Kolmogorov-Arnold Networks
Paper
• 2404.19756
• Published
• 116
OmniGen: Unified Image Generation
Paper
• 2409.11340
• Published
• 115
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video
Diffusion Models
Paper
• 2409.07452
• Published
• 21
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
• 2409.02795
• Published
• 72
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
• 2409.11355
• Published
• 30
Qwen2.5-Coder Technical Report
Paper
• 2409.12186
• Published
• 153
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published
• 78
VideoPoet: A Large Language Model for Zero-Shot Video Generation
Paper
• 2312.14125
• Published
• 47
Training Language Models to Self-Correct via Reinforcement Learning
Paper
• 2409.12917
• Published
• 140
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
• 2409.13346
• Published
• 69
Colorful Diffuse Intrinsic Image Decomposition in the Wild
Paper
• 2409.13690
• Published
• 13
Emu3: Next-Token Prediction is All You Need
Paper
• 2409.18869
• Published
• 97
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
• 2409.20566
• Published
• 55
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
• 2410.02740
• Published
• 54
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models
Paper
• 2410.02757
• Published
• 36
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Paper
• 2410.02073
• Published
• 43
Baichuan-Omni Technical Report
Paper
• 2410.08565
• Published
• 87
Animate-X: Universal Character Image Animation with Enhanced Motion
Representation
Paper
• 2410.10306
• Published
• 56
Efficient Diffusion Models: A Comprehensive Survey from Principles to
Practices
Paper
• 2410.11795
• Published
• 18
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
• 2410.16268
• Published
• 69
SpectroMotion: Dynamic 3D Reconstruction of Specular Scenes
Paper
• 2410.17249
• Published
• 44
Movie Gen: A Cast of Media Foundation Models
Paper
• 2410.13720
• Published
• 100
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
• 2410.13863
• Published
• 37
FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without
Learned Priors
Paper
• 2410.16271
• Published
• 84
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
• 2410.13861
• Published
• 56
Unbounded: A Generative Infinite Game of Character Life Simulation
Paper
• 2410.18975
• Published
• 37
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for
Contrastive Loss
Paper
• 2410.17243
• Published
• 92
Representation Alignment for Generation: Training Diffusion Transformers
Is Easier Than You Think
Paper
• 2410.06940
• Published
• 12
Addition is All You Need for Energy-efficient Language Models
Paper
• 2410.00907
• Published
• 151
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
• 2410.13848
• Published
• 35
Semantic Image Inversion and Editing using Rectified Stochastic
Differential Equations
Paper
• 2410.10792
• Published
• 31
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
• 2410.18057
• Published
• 209
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
• 2411.04997
• Published
• 39
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models
Paper
• 2411.07232
• Published
• 68
OmniEdit: Building Image Editing Generalist Models Through Specialist
Supervision
Paper
• 2411.07199
• Published
• 50
Large Language Models Can Self-Improve in Long-context Reasoning
Paper
• 2411.08147
• Published
• 65
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video
Generation
Paper
• 2411.08380
• Published
• 25
LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models
Paper
• 2411.09595
• Published
• 77
MagicQuill: An Intelligent Interactive Image Editing System
Paper
• 2411.09703
• Published
• 80
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
Region-Aware Text-to-Image Generation via Hard Binding and Soft
Refinement
Paper
• 2411.06558
• Published
• 36
AnimateAnything: Consistent and Controllable Animation for Video
Generation
Paper
• 2411.10836
• Published
• 24
RedPajama: an Open Dataset for Training Large Language Models
Paper
• 2411.12372
• Published
• 57
SageAttention2 Technical Report: Accurate 4 Bit Attention for
Plug-and-play Inference Acceleration
Paper
• 2411.10958
• Published
• 57
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking
with Motion-Aware Memory
Paper
• 2411.11922
• Published
• 19
Stable Flow: Vital Layers for Training-Free Image Editing
Paper
• 2411.14430
• Published
• 22
Style-Friendly SNR Sampler for Style-Driven Generation
Paper
• 2411.14793
• Published
• 39
Star Attention: Efficient LLM Inference over Long Sequences
Paper
• 2411.17116
• Published
• 53
Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot
Subject-Driven Image Generator
Paper
• 2411.15466
• Published
• 39
Material Anything: Generating Materials for Any 3D Object via Diffusion
Paper
• 2411.15138
• Published
• 50
OminiControl: Minimal and Universal Control for Diffusion Transformer
Paper
• 2411.15098
• Published
• 61
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent
Video Diffusion Model
Paper
• 2411.17459
• Published
• 12
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video
Generation
Paper
• 2412.02259
• Published
• 60
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
Reasoning Capability
Paper
• 2411.19943
• Published
• 62
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
• 2412.03555
• Published
• 133
SNOOPI: Supercharged One-step Diffusion Distillation with Proper
Guidance
Paper
• 2412.02687
• Published
• 113
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
• 2412.03069
• Published
• 34
Imagine360: Immersive 360 Video Generation from Perspective Anchor
Paper
• 2412.03552
• Published
• 29
Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion
Paper
• 2412.03515
• Published
• 27
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking
Portrait
Paper
• 2412.01064
• Published
• 47
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
• 2412.00927
• Published
• 29
Open-Sora Plan: Open-Source Large Video Generation Model
Paper
• 2412.00131
• Published
• 33
SpotLight: Shadow-Guided Object Relighting via Diffusion
Paper
• 2411.18665
• Published
• 3
Video Depth without Video Models
Paper
• 2411.19189
• Published
• 39
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction
using Diffusion Models
Paper
• 2411.18350
• Published
• 28
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
Paper
• 2411.18613
• Published
• 59
Pathways on the Image Manifold: Image Editing via Video Generation
Paper
• 2411.16819
• Published
• 37
Identity-Preserving Text-to-Video Generation by Frequency Decomposition
Paper
• 2411.17440
• Published
• 37
ROICtrl: Boosting Instance Control for Visual Generation
Paper
• 2411.17949
• Published
• 87
LumiNet: Latent Intrinsics Meets Diffusion Models for Indoor Scene
Relighting
Paper
• 2412.00177
• Published
• 8
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published
• 117
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published
• 62
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper
• 2412.04454
• Published
• 71
Structured 3D Latents for Scalable and Versatile 3D Generation
Paper
• 2412.01506
• Published
• 86
A Noise is Worth Diffusion Guidance
Paper
• 2412.03895
• Published
• 29
AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent
Diffusion Models
Paper
• 2412.04146
• Published
• 23
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
• 2412.05271
• Published
• 160
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
Diffusion
Paper
• 2412.04301
• Published
• 40
APOLLO: SGD-like Memory, AdamW-level Performance
Paper
• 2412.05270
• Published
• 37
STIV: Scalable Text and Image Conditioned Video Generation
Paper
• 2412.07730
• Published
• 74
UniReal: Universal Image Generation and Editing via Learning Real-world
Dynamics
Paper
• 2412.07774
• Published
• 30
Video Motion Transfer with Diffusion Transformers
Paper
• 2412.07776
• Published
• 17
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse
Viewpoints
Paper
• 2412.07760
• Published
• 55
StyleMaster: Stylize Your Video with Artistic Generation and Translation
Paper
• 2412.07744
• Published
• 20
Track4Gen: Teaching Video Diffusion Models to Track Points Improves
Video Generation
Paper
• 2412.06016
• Published
• 20
Learning Flow Fields in Attention for Controllable Person Image
Generation
Paper
• 2412.08486
• Published
• 36
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published
• 97
Paper
• 2412.08905
• Published
• 122
Neural LightRig: Unlocking Accurate Object Normal and Material
Estimation with Multi-Light Diffusion
Paper
• 2412.09593
• Published
• 18
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper
• 2412.15213
• Published
• 28
Parallelized Autoregressive Visual Generation
Paper
• 2412.15119
• Published
• 53
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Paper
• 2412.13649
• Published
• 21
B-STaR: Monitoring and Balancing Exploration and Exploitation in
Self-Taught Reasoners
Paper
• 2412.17256
• Published
• 47
Paper
• 2412.15115
• Published
• 377
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published
• 147
GenEx: Generating an Explorable World
Paper
• 2412.09624
• Published
• 98
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
• 2412.09604
• Published
• 38
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free
Scale Fusion
Paper
• 2412.09626
• Published
• 21
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
• 2412.09283
• Published
• 19
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published
• 108
BrushEdit: All-In-One Image Inpainting and Editing
Paper
• 2412.10316
• Published
• 36
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper
• 2412.11815
• Published
• 26
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Paper
• 2412.14171
• Published
• 24
Diffusion360: Seamless 360 Degree Panoramic Image Generation based on
Diffusion Models
Paper
• 2311.13141
• Published
• 16
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published
• 109
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion
Control
Paper
• 2501.01427
• Published
• 53
LTX-Video: Realtime Video Latent Diffusion
Paper
• 2501.00103
• Published
• 50
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent
Diffusion Models
Paper
• 2501.01423
• Published
• 44
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse
Task Synthesis
Paper
• 2412.19723
• Published
• 87
Paper
• 2412.18653
• Published
• 86
Orient Anything: Learning Robust Object Orientation Estimation from
Rendering 3D Models
Paper
• 2412.18605
• Published
• 21
DepthLab: From Partial to Complete
Paper
• 2412.18153
• Published
• 36
Fourier Position Embedding: Enhancing Attention's Periodic Extension for
Length Generalization
Paper
• 2412.17739
• Published
• 41
DynamicScaler: Seamless and Scalable Video Generation for Panoramic
Scenes
Paper
• 2412.11100
• Published
• 7
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices
with Efficient Architectures and Training
Paper
• 2412.09619
• Published
• 30
PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh
Representations
Paper
• 2412.05994
• Published
• 19
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
• 2412.08580
• Published
• 45
StreamChat: Chatting with Streaming Video
Paper
• 2412.08646
• Published
• 18
Generative Densification: Learning to Densify Gaussians for
High-Fidelity Generalizable 3D Reconstruction
Paper
• 2412.06234
• Published
• 19
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion
Transformer
Paper
• 2412.07720
• Published
• 31
Around the World in 80 Timesteps: A Generative Approach to Global Visual
Geolocation
Paper
• 2412.06781
• Published
• 23
3D Convex Splatting: Radiance Field Rendering with 3D Smooth Convexes
Paper
• 2411.14974
• Published
• 15
TEXGen: a Generative Diffusion Model for Mesh Textures
Paper
• 2411.14740
• Published
• 17
SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting
Synthesis
Paper
• 2411.16443
• Published
• 11
Mixture-of-Transformers: A Sparse and Scalable Architecture for
Multi-Modal Foundation Models
Paper
• 2411.04996
• Published
• 50
DimensionX: Create Any 3D and 4D Scenes from a Single Image with
Controllable Video Diffusion
Paper
• 2411.04928
• Published
• 56
ReCapture: Generative Video Camera Controls for User-Provided Videos
using Masked Video Fine-Tuning
Paper
• 2411.05003
• Published
• 71
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM
Quantization
Paper
• 2411.02355
• Published
• 51
How Far is Video Generation from World Model: A Physical Law Perspective
Paper
• 2411.02385
• Published
• 34
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
Parameters by Tencent
Paper
• 2411.02265
• Published
• 25
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Paper
• 2411.02397
• Published
• 23
MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D
Paper
• 2411.02336
• Published
• 24
AutoVFX: Physically Realistic Video Editing from Natural Language
Instructions
Paper
• 2411.02394
• Published
• 16
GenXD: Generating Any 3D and 4D Scenes
Paper
• 2411.02319
• Published
• 20
Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse
Autoencoders
Paper
• 2410.22366
• Published
• 84
One Shot, One Talk: Whole-body Talking Avatar from a Single Image
Paper
• 2412.01106
• Published
• 24
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling
Paper
• 2411.18664
• Published
• 24
FAM Diffusion: Frequency and Attention Modulation for High-Resolution
Image Generation with Stable Diffusion
Paper
• 2411.18552
• Published
• 18
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based
Image Editing
Paper
• 2412.04280
• Published
• 13
MV-Adapter: Multi-view Consistent Image Generation Made Easy
Paper
• 2412.03632
• Published
• 24
PanoDreamer: 3D Panorama Synthesis from a Single Image
Paper
• 2412.04827
• Published
• 10
GenMAC: Compositional Text-to-Video Generation with Multi-Agent
Collaboration
Paper
• 2412.04440
• Published
• 22
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos
Paper
• 2501.04001
• Published
• 47
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
• 2501.02955
• Published
• 44
Cosmos World Foundation Model Platform for Physical AI
Paper
• 2501.03575
• Published
• 82
Dispider: Enabling Video LLMs with Active Real-Time Interaction via
Disentangled Perception, Decision, and Reaction
Paper
• 2501.03218
• Published
• 35
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for
Real-World Video Super-Resolution
Paper
• 2501.02976
• Published
• 56
An Empirical Study of Autoregressive Pre-training from Videos
Paper
• 2501.05453
• Published
• 41
OmniManip: Towards General Robotic Manipulation via Object-Centric
Interaction Primitives as Spatial Constraints
Paper
• 2501.03841
• Published
• 56
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Paper
• 2501.05874
• Published
• 75
GameFactory: Creating New Games with Generative Interactive Videos
Paper
• 2501.08325
• Published
• 67
CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation
Paper
• 2501.09433
• Published
• 18
Do generative video models learn physical principles from watching
videos?
Paper
• 2501.09038
• Published
• 34
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
Paper
• 2501.09751
• Published
• 46
Diffusion Adversarial Post-Training for One-Step Video Generation
Paper
• 2501.08316
• Published
• 36
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published
• 34
MangaNinja: Line Art Colorization with Precise Reference Following
Paper
• 2501.08332
• Published
• 62
VideoAuteur: Towards Long Narrative Video Generation
Paper
• 2501.06173
• Published
• 31
Tensor Product Attention Is All You Need
Paper
• 2501.06425
• Published
• 90
Evolving Deeper LLM Thinking
Paper
• 2501.09891
• Published
• 115
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in
Virtual 3D Spaces
Paper
• 2501.12909
• Published
• 74
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
The Lessons of Developing Process Reward Models in Mathematical
Reasoning
Paper
• 2501.07301
• Published
• 100
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow
Models
Paper
• 2412.08629
• Published
• 13
SRMT: Shared Memory for Multi-agent Lifelong Pathfinding
Paper
• 2501.13200
• Published
• 69
Florence-2: Advancing a Unified Representation for a Variety of Vision
Tasks
Paper
• 2311.06242
• Published
• 95
Elucidating the Design Space of Diffusion-Based Generative Models
Paper
• 2206.00364
• Published
• 18
Improving Video Generation with Human Feedback
Paper
• 2501.13918
• Published
• 52
Can We Generate Images with CoT? Let's Verify and Reinforce Image
Generation Step by Step
Paper
• 2501.13926
• Published
• 43
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
• 2501.17161
• Published
• 124
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian
Splat Generation
Paper
• 2501.16764
• Published
• 22
MatAnyone: Stable Video Matting with Consistent Memory Propagation
Paper
• 2501.14677
• Published
• 34
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot
Planning
Paper
• 2411.04983
• Published
• 13
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
• 2401.10774
• Published
• 59
SAMPart3D: Segment Any Part in 3D Objects
Paper
• 2411.07184
• Published
• 28
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
Paper
• 2502.01639
• Published
• 26