• Splits a chunk that exceeds a token limit into smaller sub-chunks

    Uses a simple recursive strategy of splitting in half until all pieces are under the token limit. This prevents chunks from being rejected by the embedding API due to size constraints.

    Parameters

    • chunk: FileChunk

      Chunk to split

    • maxTokens: number

      Maximum tokens allowed per chunk

    • charsPerToken: number = 2.5

      Estimated characters per token (default: 2.5 for code)

    Returns FileChunk[]

    Array of sub-chunks that fit within token limit

    // Split a chunk that's too large
    const oversized = { content: "very long content...", ... };
    const subChunks = split_chunk(oversized, 6000, 2.5);
    // Returns array of smaller chunks, each under 6000 tokens