Skip to main content
All functions listed in this document are safe to call from the main thread and all callbacks will be run on the main thread, unless there are explicit instructions or explanations.

ModelRunner

A ModelRunner represents a loaded model instance that creates conversations and drives generation.
interface ModelRunner {
  fun createConversation(systemPrompt: String? = null): Conversation
  fun createConversationFromHistory(history: List<ChatMessage>): Conversation
  suspend fun unload()
  fun generateFromConversation(
    conversation: Conversation,
    callback: GenerationCallback,
    generationOptions: GenerationOptions? = null,
  ): GenerationHandler
}

Lifecycle

  • Create conversations using createConversation(systemPrompt:) or createConversationFromHistory(history:).
  • Hold a strong reference to the ModelRunner for as long as you need to perform generations. If the model runner is destroyed, any conversations it created will fail to generate.
  • Call unload() when you are done to release native resources.
  • On iOS, unload() is async and cleanup also happens automatically on deinit. Access modelId to identify the loaded model.
  • On Android, if you need your model runner to survive after the destruction of activities, you may need to wrap it in an Android Service.

Low-level generation API

Both platforms expose a lower-level generation method that returns a GenerationHandler for cancellation. Most apps should use the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control.
generateFromConversation(...) is an internal interface for the model runner implementation. Conversation.generateResponse is the recommended wrapper, which relies on Kotlin coroutines for lifecycle-aware components.
generateFromConversation may block the caller thread. If you must use it, call it outside the main thread.

GenerationHandler

The handler returned by the low-level generation API or Conversation.generateResponse lets you cancel generation without tearing down the conversation.
interface GenerationHandler {
  fun stop()
}

Conversation

Conversation tracks chat state and provides streaming helpers built on top of the model runner. Instances should always be created from a ModelRunner, not initialized directly.
interface Conversation {
  val history: List<ChatMessage>
  val isGenerating: Boolean

  fun generateResponse(
    userTextMessage: String,
    generationOptions: GenerationOptions? = null
  ): Flow<MessageResponse>

  fun generateResponse(
    message: ChatMessage,
    generationOptions: GenerationOptions? = null
  ): Flow<MessageResponse>

  fun registerFunction(function: LeapFunction)
  fun exportToJSONArray(): JSONArray
}

Properties

  • history β€” Returns a copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully. During an ongoing generation, the partial message may not be present.
  • isGenerating β€” true while a generation is running. On Kotlin, its value is consistent across all threads. On Swift, attempting to start a new generation while this is true immediately finishes with an empty stream (or nil handler for the callback variant).

Streaming Generation

The primary pattern for generating responses is to collect the stream returned by generateResponse.
The return value is a Kotlin asynchronous Flow. Generation does not start until the flow is collected. Refer to the Android documentation on how to properly handle flows with lifecycle-aware components.
viewModelScope.launch {
  conversation.generateResponse(userInput)
    .onEach { response ->
      when (response) {
        is MessageResponse.Chunk -> {
          generatedText += response.text
        }
        is MessageResponse.ReasoningChunk -> {
          Log.d(TAG, "Reasoning: ${response.reasoning}")
        }
        is MessageResponse.FunctionCalls -> {
          handleFunctionCalls(response.functionCalls)
        }
        is MessageResponse.AudioSample -> {
          audioRenderer.enqueue(response.samples, response.sampleRate)
        }
        is MessageResponse.Complete -> {
          Log.d(TAG, "Generation is done!")
        }
      }
    }
    .catch { e -> Log.e(TAG, "Generation failed", e) }
    .collect()
}
Errors will be thrown as LeapGenerationException in the stream. Use .catch to capture errors from the generation.
If there is already a running generation, further generation requests are blocked until the current generation is done. However, there is no guarantee that the order in which requests are received will be the order in which they are processed.

Callback Convenience (Swift only)

On Swift, use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:
let handler = conversation.generateResponse(message: user) { response in
  updateUI(with: response)
}

// Later
handler?.stop()
If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.
The callback overload does not surface generation errors. Use the async-stream helper or call ModelRunner.generateResponse with onErrorCallback when you need error handling.

Function Registration

Register functions for the model to invoke during generation. See the Function Calling guide for detailed usage.

Export Chat History

Export the conversation history into a serialized format that mirrors OpenAI’s chat-completions schema. Useful for persistence, analytics, or debugging.
exportToJSONArray() returns a JSONArray. Each element can be interpreted as a ChatCompletionRequestMessage instance in the OpenAI API schema.See also: Serialization Support.

Cancellation

Generation stops when the coroutine Job that collects the flow is cancelled. We highly recommend using a ViewModel with viewModelScope to manage the generation lifecycle. The generation will be automatically cancelled when the ViewModel is cleared.
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.Job
import kotlinx.coroutines.runBlocking

class ChatViewModel(application: Application) : AndroidViewModel(application) {
    private var conversation: Conversation? = null
    private var modelRunner: ModelRunner? = null
    private var generationJob: Job? = null

    private val _generatedText = MutableStateFlow("")
    val generatedText: StateFlow<String> = _generatedText.asStateFlow()

    fun generateResponse(userInput: String) {
        generationJob = viewModelScope.launch {
            _generatedText.value = ""
            conversation?.generateResponse(userInput)
                ?.onEach { response ->
                    when (response) {
                        is MessageResponse.Chunk -> {
                            _generatedText.value += response.text
                        }
                        is MessageResponse.Complete -> {
                            Log.d(TAG, "Generation is done!")
                        }
                        else -> {}
                    }
                }
                ?.collect()
        }
    }

    fun stopGeneration() {
        generationJob?.cancel()
        generationJob = null
    }

    override fun onCleared() {
        super.onCleared()
        generationJob?.cancel()

        // Use runBlocking to ensure model is unloaded before ViewModel is destroyed
        // viewModelScope is cancelled during clearing, so we need a non-cancelled context
        runBlocking(Dispatchers.IO) {
            modelRunner?.unload()
        }
    }

    companion object {
        private const val TAG = "ChatViewModel"
    }
}

MessageResponse

The response emitted during generation. Text is streamed as chunks, with a final completion signal when the model finishes.
sealed interface MessageResponse {
  class Chunk(val text: String) : MessageResponse
  class ReasoningChunk(val reasoning: String) : MessageResponse
  class FunctionCalls(val functionCalls: List<LeapFunctionCall>) : MessageResponse
  class AudioSample(val samples: FloatArray, val sampleRate: Int) : MessageResponse
  class Complete(
    val fullMessage: ChatMessage,
    val finishReason: GenerationFinishReason,
    val stats: GenerationStats?,
  ) : MessageResponse
}

Response types

  • Chunk β€” Partial assistant text emitted during streaming.
  • ReasoningChunk β€” Model reasoning tokens (only for models that expose reasoning traces, wrapped between <think> / </think>).
  • AudioSample β€” PCM audio frames streamed from audio-capable checkpoints. Feed them into an audio renderer or buffer for later playback. The sample rate remains constant throughout a generation.
  • FunctionCall / FunctionCalls β€” One or more function/tool invocations requested by the model. See the Function Calling guide.
  • Complete β€” Signals the end of generation. Access the assembled assistant reply through the full message. The finishReason indicates why generation stopped (STOP means the model decided to stop; EXCEED_CONTEXT means the maximum context length was reached). The optional stats field contains generation statistics.
Errors during streaming are delivered through the thrown error of AsyncThrowingStream (Swift) or as LeapGenerationException in the Flow (Kotlin).

GenerationStats

Statistics about a completed generation.
data class GenerationStats(
  val promptTokens: Long,
  val completionTokens: Long,
  val totalTokens: Long,
  val tokenPerSecond: Float,
)

GenerationOptions

Tune generation behavior per request. Leave a field as nil/null to fall back to the defaults packaged with the model bundle.
data class GenerationOptions(
    var temperature: Float? = null,
    var topP: Float? = null,
    var minP: Float? = null,
    var repetitionPenalty: Float? = null,
    var jsonSchemaConstraint: String? = null,
    var functionCallParser: LeapFunctionCallParser? = LFMFunctionCallParser(),
) {
  fun setResponseFormatType(kClass: KClass<*>)

  companion object {
    fun build(buildAction: GenerationOptions.() -> Unit): GenerationOptions
  }
}
Fields:
  • temperature β€” Sampling temperature. Higher values produce more random output; lower values produce more focused, deterministic output.
  • topP β€” Nucleus sampling parameter. The model only considers tokens with cumulative probability mass up to topP.
  • minP β€” Minimum probability for a token to be considered during generation.
  • repetitionPenalty β€” Penalizes repeated tokens. A positive value decreases the likelihood of repeating the same line verbatim.
  • jsonSchemaConstraint β€” Enable constrained generation with a JSON Schema. See constrained generation for details.
  • functionCallParser β€” Parser for function calling requests. LFMFunctionCallParser (the default) handles Liquid Foundation Model Pythonic function calling. See the Function Calling guide for details.
Builder pattern:
val options = GenerationOptions.build {
  setResponseFormatType(MyDataType::class)
  temperature = 0.5f
}