Skip to main content

Implementing Multimodal AIGC

Multimodal AIGC (AI-Generated Content) refers to the capability of leveraging artificial intelligence to automatically generate various forms of content—such as images and videos—based on text prompts. The JitAi platform deeply integrates multimodal generation capabilities, enabling developers to effortlessly implement innovative features like text-to-image and text-to-video in their applications.

tip

The current version supports only the Alibaba Cloud Bailian platform's Wanx series models. Please ensure that Bailian API keys are configured in the system before use. Support for additional LLM vendors is under development.

Bailian

Text-to-image

Example

Imagine entering a story text and having the system automatically generate matching illustrations for a picture book—this is a real-world application of the JitAi platform's text-to-image capabilities.

Taking children's picture book creation as an example, we've implemented an intelligent text-to-image workflow: through 5 AI processing stages (story segmentation → character design → scene design → illustration style → prompt optimization), the system progressively transforms ordinary story text into professional text-to-image prompts, ultimately generating batches of illustrations with consistent style and coherent narrative. The entire process requires no manual intervention and can complete in minutes what would otherwise take designers hours or even days.

This example demonstrates the core advantage of the JitAi platform: building complex AI workflows by chaining multiple LLM calls. You can combine capabilities like text generation, logical reasoning, and image generation like building blocks to create your own AIGC applications.

Loading...

Calling text-to-image in pages

You can quickly integrate text-to-image functionality in pages.

Loading...

Open the event panel in the page, create a new basic statement, select AI Large Language Model - Bailian LLM Instance - Text-to-Image, and the corresponding function call code will be generated. Click the Text-to-Image button in the code to open the text-to-image configuration panel.

Loading...

In the configuration panel, select an appropriate model and configure the prompt and model parameters, then click OK to complete the configuration. For detailed parameter descriptions, refer to Text-to-image parameters.

Calling text-to-image in backend functions

Text-to-image functionality can be called in backend functions (services, model functions, tasks, and events).

Loading...

In the backend function panel, create a new basic statement, select AI Large Language Model - Bailian LLM Instance - Text-to-Image, and the corresponding function call code will be generated. Click the Text-to-Image button in the code to open the text-to-image configuration panel.

Loading...

In the configuration panel, select an appropriate model and configure the prompt and model parameters, then click OK to complete the configuration. For detailed parameter descriptions, refer to Text-to-image parameters.

Text-to-video

Example

Text-to-video goes beyond simply converting text to video—it's a creative process that transforms static text into dynamic visual narratives.

Taking classical poetry visualization as an example, we've built a professional text-to-video workflow: through 5 specialized AI processing stages (paragraph segmentation → script writing → cinematography design → visual effects post-production → prompt optimization), the system progressively transforms classical poetry or classical Chinese text into professional, cinema-grade video prompts. Each stage simulates real film production workflows—from storyboard design and camera language planning to lighting and color grading—ultimately generating batches of short video clips with consistent style and smooth cinematography.

Compared to traditional video production, which requires collaboration across multiple stages like scriptwriting, directing, cinematography, and post-production, this workflow encodes professional knowledge into AI prompts, enabling ordinary users to create professional-quality video content. This is the unique value of the JitAi platform: condensing complex professional knowledge into reusable AI workflows, democratizing content creation.

Loading...

Calling text-to-video in pages

Usage is consistent with text-to-image. Refer to Calling text-to-image in pages. For configuration parameters, refer to Text-to-video parameters.

Calling text-to-video in backend functions

Usage is consistent with text-to-image. Refer to Calling text-to-image in backend functions. For configuration parameters, refer to Text-to-video parameters.

Image-to-image

Image-to-image supports image editing and creation based on reference images, including single-image editing and multi-reference image generation modes. By inputting images and text prompts, you can achieve creative functions like image style transfer, element replacement, and multi-image fusion.

Calling image-to-image in pages

Usage is consistent with text-to-image. Refer to Calling text-to-image in pages. For configuration parameters, refer to Image-to-image parameters.

Calling image-to-image in backend functions

Usage is consistent with text-to-image. Refer to Calling text-to-image in backend functions. For configuration parameters, refer to Image-to-image parameters.

Image-to-video

Image-to-video supports generating dynamic video content based on a first-frame image. By uploading an image as the starting frame and combining it with text prompts or effect templates, the system automatically generates smooth video animations. This feature is particularly suitable for adding dynamic effects to static images, creating product showcases, character animations, and more.

Calling image-to-video in pages

Usage is consistent with text-to-image. Refer to Calling text-to-image in pages. For configuration parameters, refer to Image-to-video parameters.

Calling image-to-video in backend functions

Usage is consistent with text-to-image. Refer to Calling text-to-image in backend functions. For configuration parameters, refer to Image-to-video parameters.

Keyframe-to-video

Keyframe-to-video supports automatically generating intermediate transition animations by specifying first and last frame keyframe images. This feature enables precise control over the video's starting and ending frames, making it suitable for creating transition animations, state change demonstrations, and other video content that requires clear start and end states.

Calling keyframe-to-video in pages

Usage is consistent with text-to-image. Refer to Calling text-to-image in pages. For configuration parameters, refer to Keyframe-to-video parameters.

Calling keyframe-to-video in backend functions

Usage is consistent with text-to-image. Refer to Calling text-to-image in backend functions. For configuration parameters, refer to Keyframe-to-video parameters.

Parameter reference

Text-to-image parameters

Model selection

The system supports 6 text-to-image models from the Wanx series. Choose an appropriate model based on different scenarios:

Model NameFeaturesUse Cases
wan2.5-t2i-previewSupports free-form dimensions (768×768 to 1440×1440)
Aspect ratio 1:4 to 4:1
Creative designs requiring special dimensions (e.g., tall images, wide images)
wan2.2-t2i-flash50% speed improvement, ultra-fast generationRapid iteration and batch generation scenarios
wan2.2-t2i-plusComprehensive stability and success rate improvementsCommercial projects and scenarios with high quality requirements
wanx2.1-t2i-turboFast generation speedRapid prototyping and preview effects
wanx2.1-t2i-plusHigh-quality image generationProfessional design and refined creation
wanx2.0-t2i-turboClassic version, stable and reliableGeneral image generation needs
Configuration parameters

The following parameters can be configured in the configuration panel:

  • Positive prompt (prompt) *: Describes the desired image content, supports variable insertion
  • Negative prompt (negative_prompt): Specifies elements to avoid
  • Image size (size) *: Size of the generated image, e.g., 1024×1024, 1280×720
  • Generation count (n): Number of images to generate in one batch, range 1-4
  • Random seed (seed): Fixed seed for reproducible results
  • Watermark setting (watermark): Whether to add a watermark to generated images
  • Intelligent prompt rewriting (prompt_extend): Uses LLM to optimize prompts, significantly improves results for short prompts
tip

Different models support slightly different parameters. The configuration panel automatically displays available parameters based on the selected model. For detailed parameter descriptions, refer to the Tongyi Wanxiang Text-to-Image Official Documentation

Text-to-video parameters

Model selection

The system supports 4 text-to-video models from the Wanx series. Choose an appropriate model based on different needs:

Model NameAudio SupportResolutionDurationFeaturesUse Cases
wan2.5-t2v-preview✅ Audio video480P/720P/1080P5s/10sSupports auto audio and custom audioPromotional videos and educational content requiring audio-visual synchronization
wan2.2-t2v-plus❌ Silent video480P/1080P5s50% stability improvement, faster speedCommercial projects and batch generation
wanx2.1-t2v-turbo❌ Silent video480P/720P5sFast generation speedRapid iteration and prototyping
wanx2.1-t2v-plus❌ Silent video720P5sHigh-quality video generationRefined creation and professional design
Audio video

Only the wan2.5-t2v-preview model supports audio functionality (auto audio or custom audio). Other models generate silent videos.

Configuration parameters

The following parameters can be configured in the configuration panel:

  • Positive prompt (prompt) *: Describes the desired video content, should include scene, action, camera angles, and other elements
  • Negative prompt (negative_prompt): Specifies elements or effects to avoid
  • Video resolution (size): e.g., 1280×720 (720P), 1920×1080 (1080P)
  • Video duration (duration): 5 seconds or 10 seconds (wan2.5 supports 10 seconds, other models only 5 seconds)
  • Custom audio (audio_url): Upload custom audio file (5-12 seconds), only wan2.5 supported
  • Auto audio (audio): Model automatically generates matching background audio, only wan2.5 supported
  • Watermark setting (watermark): Whether to add a watermark to generated videos
  • Intelligent prompt rewriting (prompt_optimizer): Uses LLM to optimize prompts, improving generation results
Important notes
  • Text-to-video uses asynchronous calls and typically takes 1-5 minutes
  • The returned video_url is retained for only 24 hours. Please download and store it in permanent storage promptly
  • Billing is based on video duration in seconds, and charges apply only when the task succeeds
tip

Different models support slightly different parameters. The configuration panel automatically displays available parameters based on the selected model. For detailed parameter descriptions, refer to the Tongyi Wanxiang Text-to-Video Official Documentation

Image-to-image parameters

Model selection

The system supports image-to-image models from the Wanx series:

Model NameFeaturesUse Cases
wan2.5-i2i-previewSupports single-image editing and multi-reference image generation
Maintains input image aspect ratio
Image style transfer, element replacement, multi-image fusion creation
Configuration parameters

The following parameters can be configured in the configuration panel:

  • Input images (images) *:
    • Single-image editing: Input a single image URL or select a variable
    • Multi-reference image generation: Input a JitList variable or multiple image URLs
  • Positive prompt (prompt) *: Describes the desired image editing effect or generated content, supports variable insertion
  • Negative prompt (negative_prompt): Specifies elements to avoid
  • Generation count (n): Number of images to generate in one batch, range 1-4
  • Random seed (seed): Fixed seed for reproducible results, range 0-4294967290
  • Watermark setting (watermark): Whether to add a watermark to generated images
Important notes
  • Image-to-image uses asynchronous calls and typically takes 1-2 minutes
  • The returned image URLs are retained for only 24 hours. Please download and store them in permanent storage promptly
  • Billing is based on the number of successfully generated images
tip

For detailed parameter descriptions, refer to the Tongyi Wanxiang Image-to-Image Official Documentation

Image-to-video parameters

Model selection

The system supports multiple image-to-video models from the Wanx series. Choose an appropriate model based on different needs:

Model NameAudio SupportResolutionDurationFeaturesUse Cases
wan2.5-i2v-preview✅ Audio video480P/720P/1080P5s/10sSupports audio and effect templatesAnimating static images, product showcases
wan2.2-i2v-plus❌ Silent video480P/1080P5sImproved stability and faster speedCommercial projects and batch generation
wan2.2-i2v-flash❌ Silent video480P/720P/1080P5sUltra-fast generationQuick preview and prototyping
wanx2.1-i2v-turbo❌ Silent video480P/720P5sFast generation speedRapid iteration
wanx2.1-i2v-plus❌ Silent video720P5sHigh-quality video generationRefined creation
Audio video

Only the wan2.5-i2v-preview model supports audio functionality (auto audio or custom audio). Other models generate silent videos.

Configuration parameters

The following parameters can be configured in the configuration panel:

  • First frame image URL (img_url) *: Starting frame image for the video, supports URL or variable
  • Effect template (template): Select preset animation effects, options include:
    • General effects: Magic floating (flying), stress relief squeeze (squish), spinning circle (rotation), poke fun (poke), balloon inflation (balloon), giving roses (rose), crystal rose (crystalrose)
    • Single-person effects: Funky dance (dance1), midnight disco (dance2), star shake moment (dance3), finger rhythm (dance4), dance switch (dance5), mermaid awakening (mermaid), academic coronation (graduation), giant beast pursuit (dragon), money falling from sky (money), jellyfish encounter (jellyfish), pupil travel (pupil)
    • Two-person effects: Loving hug (hug), French kiss (frenchkiss), double heartbeat (coupleheart)
  • Positive prompt (prompt): Describes the desired video animation effect (optional when using effect templates)
  • Negative prompt (negative_prompt): Specifies elements or effects to avoid
  • Video resolution (resolution): e.g., 480P, 720P, 1080P
  • Video duration (duration): 5 seconds or 10 seconds (wan2.5 supports 10 seconds, other models only 5 seconds)
  • Custom audio (audio_url): Upload custom audio file (5-12 seconds), only wan2.5 supported
  • Auto audio (audio): Model automatically generates matching background audio, only wan2.5 supported
  • Watermark setting (watermark): Whether to add a watermark to generated videos
  • Intelligent prompt rewriting (prompt_optimizer): Uses LLM to optimize prompts, improving generation results
Important notes
  • Image-to-video uses asynchronous calls and typically takes 1-5 minutes
  • The returned video_url is retained for only 24 hours. Please download and store it in permanent storage promptly
  • Billing is based on video duration in seconds, and charges apply only when the task succeeds
tip

Different models support slightly different parameters. The configuration panel automatically displays available parameters based on the selected model. For detailed parameter descriptions, refer to the Tongyi Wanxiang Image-to-Video Official Documentation

Keyframe-to-video parameters

Model selection

The system supports keyframe-to-video models from the Wanx series:

Model NameResolutionDurationFeaturesUse Cases
wan2.2-kf2v-flash480P/720P/1080P5sUltra-fast generation, supports effect templatesTransition animations, state change demonstrations
wanx2.1-kf2v-plus720P5sHigh-quality video generationRefined creation and professional design
Configuration parameters

The following parameters can be configured in the configuration panel:

  • First frame image URL (first_frame_image_url) *: Starting frame image for the video, supports URL or variable
  • Last frame image URL (last_frame_image_url): Ending frame image for the video, supports URL or variable (optional)
  • Effect template (template): Select preset animation effects, options include:
    • Tang dynasty elegance (hanfu-1)
    • Mecha transformation (solaron)
    • Shining cover (magazine)
    • Mechanical awakening (mech1)
    • Cyber appearance (mech2)
  • Positive prompt (prompt): Describes the desired video transition effect (optional when using effect templates)
  • Negative prompt (negative_prompt): Specifies elements or effects to avoid
  • Video resolution (resolution): e.g., 480P, 720P, 1080P (wan2.2-kf2v-flash supported) or 720P (wanx2.1-kf2v-plus)
  • Video duration (duration): Fixed at 5 seconds
  • Watermark setting (watermark): Whether to add a watermark to generated videos
  • Intelligent prompt rewriting (prompt_optimizer): Uses LLM to optimize prompts, improving generation results
Important notes
  • Keyframe-to-video uses asynchronous calls and typically takes 1-5 minutes
  • The returned video_url is retained for only 24 hours. Please download and store it in permanent storage promptly
  • Billing is based on video duration in seconds, and charges apply only when the task succeeds
  • If no last frame image is provided, the system will generate the video based only on the first frame
tip

For detailed parameter descriptions, refer to the Tongyi Wanxiang Keyframe-to-Video Official Documentation

OpenAI

Coming soon...

JitAI AssistantBeta
Powered by JitAI