Diffusers documentation
Cosmos
Cosmos
Cosmos World Foundation Model Platform for Physical AI by NVIDIA.
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Loading original format checkpoints
Original format checkpoints that have not been converted to diffusers-expected format can be loaded using the from_single_file method.
import torch
from diffusers import Cosmos2TextToImagePipeline, CosmosTransformer3DModel
model_id = "nvidia/Cosmos-Predict2-2B-Text2Image"
transformer = CosmosTransformer3DModel.from_single_file(
"https://huggingface.co/nvidia/Cosmos-Predict2-2B-Text2Image/blob/main/model.pt",
torch_dtype=torch.bfloat16,
).to("cuda")
pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
output = pipe(
prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
).images[0]
output.save("output.png")CosmosTextToWorldPipeline
class diffusers.CosmosTextToWorldPipeline
< source >( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )
Parameters
- text_encoder (
T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant. - tokenizer (
T5TokenizerFast) — Tokenizer of class T5Tokenizer. - transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLCosmos) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
Pipeline for text-to-world generation using Cosmos Predict1.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosPipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to720) — The height in pixels of the generated image. - width (
int, defaults to1280) — The width in pixels of the generated image. - num_frames (
int, defaults to121) — The number of frames in the generated video. - num_inference_steps (
int, defaults to36) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - guidance_scale (
float, defaults to7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. - fps (
int, defaults to30) — The frames per second of the generated video. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - output_type (
str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aCosmosPipelineOutputinstead of a plain tuple. - callback_on_step_end (
Callable,PipelineCallback,MultiPipelineCallbacks, optional) — A function or a subclass ofPipelineCallbackorMultiPipelineCallbacksthat is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class.
Returns
~CosmosPipelineOutput or tuple
If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where
the first element is a list with the generated images and the second element is a list of bools
indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import CosmosTextToWorldPipeline
>>> from diffusers.utils import export_to_video
>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
>>> pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."
>>> output = pipe(prompt=prompt).frames[0]
>>> export_to_video(output, "output.mp4", fps=30)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) — Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device — (
torch.device, optional): torch device - dtype — (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
CosmosVideoToWorldPipeline
class diffusers.CosmosVideoToWorldPipeline
< source >( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )
Parameters
- text_encoder (
T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant. - tokenizer (
T5TokenizerFast) — Tokenizer of class T5Tokenizer. - transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLCosmos) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
Pipeline for image-to-world and video-to-world generation using Cosmos Predict-1.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 input_frames_guidance: bool = False augment_sigma: float = 0.001 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosPipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to720) — The height in pixels of the generated image. - width (
int, defaults to1280) — The width in pixels of the generated image. - num_frames (
int, defaults to121) — The number of frames in the generated video. - num_inference_steps (
int, defaults to36) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - guidance_scale (
float, defaults to7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. - fps (
int, defaults to30) — The frames per second of the generated video. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - output_type (
str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aCosmosPipelineOutputinstead of a plain tuple. - callback_on_step_end (
Callable,PipelineCallback,MultiPipelineCallbacks, optional) — A function or a subclass ofPipelineCallbackorMultiPipelineCallbacksthat is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class.
Returns
~CosmosPipelineOutput or tuple
If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where
the first element is a list with the generated images and the second element is a list of bools
indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
Image conditioning:
>>> import torch
>>> from diffusers import CosmosVideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_image
>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World"
>>> pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered clouds, suggesting a bright, sunny day."
>>> image = load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input.jpg"
... )
>>> video = pipe(image=image, prompt=prompt).frames[0]
>>> export_to_video(video, "output.mp4", fps=30)Video conditioning:
>>> import torch
>>> from diffusers import CosmosVideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_video
>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World"
>>> pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.transformer = torch.compile(pipe.transformer)
>>> pipe.to("cuda")
>>> prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
>>> video = load_video(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
... )[
... :21
... ] # This example uses only the first 21 frames
>>> video = pipe(video=video, prompt=prompt).frames[0]
>>> export_to_video(video, "output.mp4", fps=30)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) — Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device — (
torch.device, optional): torch device - dtype — (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
Cosmos2TextToImagePipeline
class diffusers.Cosmos2TextToImagePipeline
< source >( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler safety_checker: CosmosSafetyChecker = None )
Parameters
- text_encoder (
T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant. - tokenizer (
T5TokenizerFast) — Tokenizer of class T5Tokenizer. - transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
Pipeline for text-to-image generation using Cosmos Predict2.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 768 width: int = 1360 num_inference_steps: int = 35 guidance_scale: float = 7.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosImagePipelineOutput or tuple
Parameters
- prompt (
strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to768) — The height in pixels of the generated image. - width (
int, defaults to1360) — The width in pixels of the generated image. - num_inference_steps (
int, defaults to35) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - guidance_scale (
float, defaults to7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. - num_images_per_prompt (
int, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - output_type (
str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aCosmosImagePipelineOutputinstead of a plain tuple. - callback_on_step_end (
Callable,PipelineCallback,MultiPipelineCallbacks, optional) — A function or a subclass ofPipelineCallbackorMultiPipelineCallbacksthat is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class.
Returns
~CosmosImagePipelineOutput or tuple
If return_dict is True, CosmosImagePipelineOutput is returned, otherwise a tuple is returned
where the first element is a list with the generated images and the second element is a list of bools
indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import Cosmos2TextToImagePipeline
>>> # Available checkpoints: nvidia/Cosmos-Predict2-2B-Text2Image, nvidia/Cosmos-Predict2-14B-Text2Image
>>> model_id = "nvidia/Cosmos-Predict2-2B-Text2Image"
>>> pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
>>> negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
>>> output = pipe(
... prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
... ).images[0]
>>> output.save("output.png")encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) — Whether to use classifier free guidance or not. - num_images_per_prompt (
int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device — (
torch.device, optional): torch device - dtype — (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
Cosmos2VideoToWorldPipeline
class diffusers.Cosmos2VideoToWorldPipeline
< source >( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler safety_checker: CosmosSafetyChecker = None )
Parameters
- text_encoder (
T5EncoderModel) — Frozen text-encoder. Cosmos uses T5; specifically the t5-11b variant. - tokenizer (
T5TokenizerFast) — Tokenizer of class T5Tokenizer. - transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (FlowMatchEulerDiscreteScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
Pipeline for video-to-world generation using Cosmos Predict2.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 93 num_inference_steps: int = 35 guidance_scale: float = 7.0 fps: int = 16 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 sigma_conditioning: float = 0.0001 ) → ~CosmosPipelineOutput or tuple
Parameters
- image (
PIL.Image.Image,np.ndarray,torch.Tensor, optional) — The image to be used as a conditioning input for the video generation. - video (
List[PIL.Image.Image],np.ndarray,torch.Tensor, optional) — The video to be used as a conditioning input for the video generation. - prompt (
strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead. - height (
int, defaults to704) — The height in pixels of the generated image. - width (
int, defaults to1280) — The width in pixels of the generated image. - num_frames (
int, defaults to93) — The number of frames in the generated video. - num_inference_steps (
int, defaults to35) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - guidance_scale (
float, defaults to7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. - fps (
int, defaults to16) — The frames per second of the generated video. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - output_type (
str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aCosmosPipelineOutputinstead of a plain tuple. - callback_on_step_end (
Callable,PipelineCallback,MultiPipelineCallbacks, optional) — A function or a subclass ofPipelineCallbackorMultiPipelineCallbacksthat is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class. - max_sequence_length (
int, defaults to512) — The maximum number of tokens in the prompt. If the prompt exceeds this length, it will be truncated. If the prompt is shorter than this length, it will be padded. - sigma_conditioning (
float, defaults to0.0001) — The sigma value used for scaling conditioning latents. Ideally, it should not be changed or should be set to a small value close to zero.
Returns
~CosmosPipelineOutput or tuple
If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where
the first element is a list with the generated images and the second element is a list of bools
indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
>>> import torch
>>> from diffusers import Cosmos2VideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_image
>>> # Available checkpoints: nvidia/Cosmos-Predict2-2B-Video2World, nvidia/Cosmos-Predict2-14B-Video2World
>>> model_id = "nvidia/Cosmos-Predict2-2B-Video2World"
>>> pipe = Cosmos2VideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
>>> negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
>>> image = load_image(
... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yellow-scrubber.png"
... )
>>> video = pipe(
... image=image, prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=16)encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) — Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device — (
torch.device, optional): torch device - dtype — (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
Cosmos2_5_PredictBasePipeline
class diffusers.Cosmos2_5_PredictBasePipeline
< source >( text_encoder: Qwen2_5_VLForConditionalGeneration tokenizer: AutoTokenizer transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: UniPCMultistepScheduler safety_checker: CosmosSafetyChecker = None )
Parameters
- text_encoder (
Qwen2_5_VLForConditionalGeneration) — Frozen text-encoder. Cosmos Predict2.5 uses the Qwen2.5 VL encoder. - tokenizer (
AutoTokenizer) — Tokenizer associated with the Qwen2.5 VL encoder. - transformer (CosmosTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (UniPCMultistepScheduler) —
A scheduler to be used in combination with
transformerto denoise the encoded image latents. - vae (AutoencoderKLWan) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
Pipeline for Cosmos Predict2.5 base model.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None video: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 93 num_inference_steps: int = 36 guidance_scale: float = 7.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 conditional_frame_timestep: float = 0.1 ) → ~CosmosPipelineOutput or tuple
Parameters
- image (
PIL.Image.Image,np.ndarray,torch.Tensor, optional) — Optional single image for Image2World conditioning. Must beNonewhenvideois provided. - video (
List[PIL.Image.Image],np.ndarray,torch.Tensor, optional) — Optional input video for Video2World conditioning. Must beNonewhenimageis provided. - prompt (
strorList[str], optional) — The prompt or prompts to guide generation. Required unlessprompt_embedsis supplied. - height (
int, defaults to704) — The height in pixels of the generated image. - width (
int, defaults to1280) — The width in pixels of the generated image. - num_frames (
int, defaults to93) — Number of output frames. Use93for world (video) generation; set to1to return a single frame. - num_inference_steps (
int, defaults to35) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference. - guidance_scale (
float, defaults to7.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. - num_videos_per_prompt (
int, optional, defaults to 1) — The number of images to generate per prompt. - generator (
torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic. - latents (
torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator. - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.FloatTensor, optional) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - output_type (
str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return aCosmosPipelineOutputinstead of a plain tuple. - callback_on_step_end (
Callable,PipelineCallback,MultiPipelineCallbacks, optional) — A function or a subclass ofPipelineCallbackorMultiPipelineCallbacksthat is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs. - callback_on_step_end_tensor_inputs (
List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class. - max_sequence_length (
int, defaults to512) — The maximum number of tokens in the prompt. If the prompt exceeds this length, it will be truncated. If the prompt is shorter than this length, it will be padded.
Returns
~CosmosPipelineOutput or tuple
If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where
the first element is a list with the generated images and the second element is a list of bools
indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation. Supports three modes:
- Text2World:
image=None,video=None,promptprovided. Generates a world clip. - Image2World:
imageprovided,video=None,promptprovided. Conditions on a single frame. - Video2World:
videoprovided,image=None,promptprovided. Conditions on an input clip.
Set num_frames=93 (default) to produce a world video, or num_frames=1 to produce a single image frame (the
above in “*2Image mode”).
Outputs follow output_type (e.g., "pil" returns a list of num_frames PIL images per prompt).
Examples:
>>> import torch
>>> from diffusers import Cosmos2_5_PredictBasePipeline
>>> from diffusers.utils import export_to_video, load_image, load_video
>>> model_id = "nvidia/Cosmos-Predict2.5-2B"
>>> pipe = Cosmos2_5_PredictBasePipeline.from_pretrained(
... model_id, revision="diffusers/base/pre-trianed", torch_dtype=torch.bfloat16
... )
>>> pipe = pipe.to("cuda")
>>> # Common negative prompt reused across modes.
>>> negative_prompt = (
... "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, "
... "over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, "
... "underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky "
... "movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, "
... "fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. "
... "Overall, the video is of poor quality."
... )
>>> # Text2World: generate a 93-frame world video from text only.
>>> prompt = (
... "As the red light shifts to green, the red bus at the intersection begins to move forward, its headlights "
... "cutting through the falling snow. The snowy tire tracks deepen as the vehicle inches ahead, casting fresh "
... "lines onto the slushy road. Around it, streetlights glow warmer, illuminating the drifting flakes and wet "
... "reflections on the asphalt. Other cars behind start to edge forward, their beams joining the scene. "
... "The stillness of the urban street transitions into motion as the quiet snowfall is punctuated by the slow "
... "advance of traffic through the frosty city corridor."
... )
>>> video = pipe(
... image=None,
... video=None,
... prompt=prompt,
... negative_prompt=negative_prompt,
... num_frames=93,
... generator=torch.Generator().manual_seed(1),
... ).frames[0]
>>> export_to_video(video, "text2world.mp4", fps=16)
>>> # Image2World: condition on a single image and generate a 93-frame world video.
>>> prompt = (
... "A high-definition video captures the precision of robotic welding in an industrial setting. "
... "The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. "
... "The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid "
... "display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring "
... "the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a "
... "ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video "
... "progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. "
... "The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. "
... "The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with "
... "the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation."
... )
>>> image = load_image(
... "https://media.githubusercontent.com/media/nvidia-cosmos/cosmos-predict2.5/refs/heads/main/assets/base/robot_welding.jpg"
... )
>>> video = pipe(
... image=image,
... video=None,
... prompt=prompt,
... negative_prompt=negative_prompt,
... num_frames=93,
... generator=torch.Generator().manual_seed(1),
... ).frames[0]
>>> # export_to_video(video, "image2world.mp4", fps=16)
>>> # Video2World: condition on an input clip and predict a 93-frame world video.
>>> prompt = (
... "The video opens with an aerial view of a large-scale sand mining construction operation, showcasing extensive piles "
... "of brown sand meticulously arranged in parallel rows. A central water channel, fed by a water pipe, flows through the "
... "middle of these sand heaps, creating ripples and movement as it cascades down. The surrounding area features dense green "
... "vegetation on the left, contrasting with the sandy terrain, while a body of water is visible in the background on the right. "
... "As the video progresses, a piece of heavy machinery, likely a bulldozer, enters the frame from the right, moving slowly along "
... "the edge of the sand piles. This machinery's presence indicates ongoing construction work in the operation. The final frame "
... "captures the same scene, with the water continuing its flow and the bulldozer still in motion, maintaining the dynamic yet "
... "steady pace of the construction activity."
... )
>>> input_video = load_video(
... "https://github.com/nvidia-cosmos/cosmos-predict2.5/raw/refs/heads/main/assets/base/sand_mining.mp4"
... )
>>> video = pipe(
... image=None,
... video=input_video,
... prompt=prompt,
... negative_prompt=negative_prompt,
... num_frames=93,
... generator=torch.Generator().manual_seed(1),
... ).frames[0]
>>> export_to_video(video, "video2world.mp4", fps=16)
>>> # To produce an image instead of a world (video) clip, set num_frames=1 and
>>> # save the first frame: pipe(..., num_frames=1).frames[0][0].encode_prompt
< source >( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )
Parameters
- prompt (
strorList[str], optional) — prompt to be encoded - negative_prompt (
strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1). - do_classifier_free_guidance (
bool, optional, defaults toTrue) — Whether to use classifier free guidance or not. - num_videos_per_prompt (
int, optional, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on - prompt_embeds (
torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument. - negative_prompt_embeds (
torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument. - device — (
torch.device, optional): torch device - dtype — (
torch.dtype, optional): torch dtype
Encodes the prompt into text encoder hidden states.
CosmosPipelineOutput
class diffusers.pipelines.cosmos.pipeline_output.CosmosPipelineOutput
< source >( frames: Tensor )
Parameters
- frames (
torch.Tensor,np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of lengthbatch_size,with each sub-list containing denoised PIL image sequences of lengthnum_frames.It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width).
Output class for Cosmos any-to-world/video pipelines.
CosmosImagePipelineOutput
class diffusers.pipelines.cosmos.pipeline_output.CosmosImagePipelineOutput
< source >( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )
Output class for Cosmos any-to-image pipelines.