NucleusMoE-Image

NucleusMoE-Image is a text-to-image model that pairs a single-stream DiT with Mixture-of-Experts feed-forward layers, cross-attention to a Qwen3-VL text encoder, and a flow-matching Euler discrete scheduler.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

NucleusMoEImagePipeline

class diffusers.NucleusMoEImagePipeline

< source >

( transformer: NucleusMoEImageTransformer2DModelscheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLQwenImagetext_encoder: Qwen3VLForConditionalGenerationprocessor: Qwen3VLProcessor )

Parameters

transformer (NucleusMoEImageTransformer2DModel) — Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformer to denoise the encoded image latents.
vae (AutoencoderKLQwenImage) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder (Qwen3VLForConditionalGeneration) — Text encoder for computing prompt embeddings.
processor (Qwen3VLProcessor) — Processor for tokenizing text inputs.

Pipeline for text-to-image generation using NucleusMoE.

This pipeline uses a single-stream DiT with Mixture-of-Experts feed-forward layers, cross-attention to a Qwen3-VL text encoder, and a flow-matching Euler discrete scheduler.

call

< source >

( prompt: str | list[str] = Nonenegative_prompt: str | list[str] = Noneguidance_scale: float = 4.0height: int | None = Nonewidth: int | None = Nonenum_inference_steps: int = 50sigmas: list[float] | None = Nonenum_images_per_prompt: int = 1max_sequence_length: int | None = Nonereturn_index: int | None = Nonegenerator: torch._C.Generator | list[torch._C.Generator] | None = Nonelatents: torch.Tensor | None = Noneprompt_embeds: torch.Tensor | None = Noneprompt_embeds_mask: torch.Tensor | None = Nonenegative_prompt_embeds: torch.Tensor | None = Nonenegative_prompt_embeds_mask: torch.Tensor | None = Noneoutput_type: str | None = 'pil'return_dict: bool = Trueattention_kwargs: dict[str, typing.Any] | None = Nonecallback_on_step_end: typing.Optional[typing.Callable[[int, int, dict], NoneType]] = Nonecallback_on_step_end_tensor_inputs: list = ['latents'] ) → NucleusMoEImagePipelineOutput or tuple

Parameters

prompt (str or list[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to pass prompt_embeds.
negative_prompt (str or list[str], optional) — The prompt or prompts not to guide the image generation. If not defined, an empty string is used when true_cfg_scale > 1.
guidance_scale (float, optional, defaults to 4.0) — Classifier-free guidance scale. Values greater than 1 enable CFG.
return_index (int, optional) — Layer index of the text encoder output to use for the prompt embeddings.
height (int, optional, defaults to self.default_sample_size * self.vae_scale_factor) — The height in pixels of the generated image.
width (int, optional, defaults to self.default_sample_size * self.vae_scale_factor) — The width in pixels of the generated image.
num_inference_steps (int, optional, defaults to 50) — The number of denoising steps.
sigmas (list[float], optional) — Custom sigmas for the denoising schedule. If not defined, a linear schedule is used.
num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
generator (torch.Generator or list[torch.Generator], optional) — One or a list of torch generators to make generation deterministic.
latents (torch.Tensor, optional) — Pre-generated noisy latents to be used as inputs for image generation.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings.
prompt_embeds_mask (torch.Tensor, optional) — Attention mask for pre-generated text embeddings.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings.
negative_prompt_embeds_mask (torch.Tensor, optional) — Attention mask for pre-generated negative text embeddings.
output_type (str, optional, defaults to "pil") — The output format of the generated image. Choose between "pil", "np", or "latent".
return_dict (bool, optional, defaults to True) — Whether or not to return a NucleusMoEImagePipelineOutput instead of a plain tuple.
attention_kwargs (dict, optional) — Kwargs passed to the attention processor.
callback_on_step_end (Callable, optional) — A function called at the end of each denoising step.
callback_on_step_end_tensor_inputs (list, optional) — Tensor inputs for the callback_on_step_end function.
max_sequence_length (int, defaults to 512) — Maximum sequence length for the text prompt.

Returns

NucleusMoEImagePipelineOutput or tuple

NucleusMoEImagePipelineOutput if return_dict is True, otherwise a tuple where the first element is a list with the generated images.

Function invoked when calling the pipeline for generation.

Examples:

>>> import torch
>>> from diffusers import NucleusMoEImagePipeline

>>> pipe = NucleusMoEImagePipeline.from_pretrained("NucleusAI/NucleusMoE-Image", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> image = pipe(prompt, num_inference_steps=50).images[0]
>>> image.save("nucleus_moe.png")

encode_prompt

< source >

Parameters

prompt (str or list[str], optional) — The prompt or prompts to encode.
device (torch.device, optional) — Torch device for the resulting tensors.
num_images_per_prompt (int, defaults to 1) — Number of images to generate per prompt.
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Skips encoding when provided.
prompt_embeds_mask (torch.Tensor, optional) — Attention mask for pre-generated embeddings.
max_sequence_length (int, defaults to 1024) — Maximum token length for the encoded prompt.

Encode text prompt(s) into embeddings using the Qwen3-VL text encoder.