Github Quic Akuruvil Efficient Transformers

Leo Migdal

-Nov 18, 2025, 4:45 AM

github quic akuruvil efficient transformers

This library provides reimplemented blocks of LLMs which are used to make the models functional and highly performant on Qualcomm Cloud AI 100. There are several models which can be directly transformed from a pre-trained original form to a deployment ready optimized form. For other models, there is comprehensive documentation to inspire upon the changes needed and How-To(s). It is mandatory for each Pull Request to include tests such as: 📝 If using ZSH terminal then "device_group" should be in single quotes e.g. "--device_group '[0]'"

QEfficient Library was designed with one goal: to make onboarding of models inference straightforward for any Transformer architecture, while leveraging the complete power of Cloud AI platform To achieve this, we have 2 levels of APIs, with different levels of abstraction. [04/2025] Support for SpD, multiprojection heads. Implemented post-attention hidden size projections to speculate tokens ahead of the base model [04/2025] QNN Compilation support for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.

[04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for disaggregated serving. [04/2025] SwiftKV Support for both continuous and non-continuous batching execution in SwiftKV. [04/2025] Support for GGUF model execution (without quantized weights) There was an error while loading. Please reload this page.

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users. Contact GitHub support about this user’s behavior. Learn more about reporting abuse. This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transformers library) into inference-ready formats that run efficien… There was an error while loading.

Please reload this page. There was an error while loading. Please reload this page. Pull requests help you collaborate on code with other people. As pull requests are created, they’ll appear here in a searchable and filterable list. To get started, you should create a pull request.

QEfficient is a Python library designed for optimizing transformer models and deploying them efficiently on Qualcomm Cloud AI 100 hardware. This library provides a seamless path from pre-trained HuggingFace models to production-ready inference on specialized AI accelerators. The library abstracts the complexity of model transformation, ONNX export, hardware compilation, and inference execution while maintaining compatibility with the familiar HuggingFace transformers interface. It supports text generation models, embedding models, vision-language models, and speech-to-text models with advanced features like continuous batching, speculative decoding, and parameter-efficient fine-tuning. For specific model architecture support, see Model Types and Architectures. For installation and environment setup, see Installation and Setup.

For detailed API documentation, see Core Architecture. QEfficient follows a layered architecture that transforms standard transformer models through multiple stages to achieve optimal performance on Qualcomm AI 100 hardware: Sources: QEfficient/transformers/models/modeling_auto.py61-102 QEfficient/base/modeling_qeff.py39-80 README.md61-83 This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transformers library) into inference-ready formats that run efficiently on Qualcomm Cloud AI 100 accelerators. Efficient Transformers Library provides reimplemented blocks of Large Language Models (LLMs) to make models functional and highly performant on Qualcomm Cloud AI 100. It includes graph transformations, handling for under-flows and overflows, patcher modules, exporter module, sample applications, and unit test templates.

The library supports seamless inference on pre-trained LLMs with documentation for model optimization and deployment. Contributions and suggestions are welcome, with a focus on testing changes for model support and common utilities. [04/2025] Support for SpD, multiprojection heads. Implemented post-attention hidden size projections to speculate tokens ahead of the base model [04/2025] QNN Compilation support for AutoModel classes. QNN compilation capabilities for multi-models, embedding models and causal models.

[04/2025] Added support for separate prefill and decode compilation for encoder (vision) and language models. This feature will be utilized for disaggregated serving. This document covers the inference and execution system in QEfficient, which provides the final runtime stage for executing optimized transformer models on Qualcomm AI 100 hardware. The inference system handles end-to-end workflows from model loading through hardware execution, supporting multiple model types and deployment scenarios. For information about model compilation prior to execution, see Export and Compilation. For details about the CLI tools and cloud interface specifically, see CLI Tools and Cloud Interface.

For text generation performance optimization techniques, see Text Generation and Performance Optimization. The QEfficient inference system orchestrates the complete pipeline from model loading to hardware execution. The workflow automatically handles model detection, optimization, compilation, and execution based on the input parameters. Sources: QEfficient/cloud/infer.py93-242 The inference system consists of several key components that handle different aspects of model execution and hardware interaction. This document covers the installation and environment setup process for QEfficient, including system requirements, dependency management, and hardware configuration for Qualcomm AI 100 deployment.

This page focuses on the initial setup required before using QEfficient's core functionality. For information about the core architecture and model classes after installation, see Core Architecture. For specific CLI usage and cloud deployment workflows, see CLI Tools and Cloud Interface. QEfficient requires specific hardware and software prerequisites to function with Qualcomm AI 100 accelerators. Sources: docs/source/installation.md1-44 pyproject.toml20-47 QEfficient has a complex dependency structure optimized for different hardware platforms and Python versions.

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Github Quic Akuruvil Efficient Transformers

People Also Search

This Library Provides Reimplemented Blocks Of LLMs Which Are Used

QEfficient Library Was Designed With One Goal: To Make Onboarding

[04/2025] Added Support For Separate Prefill And Decode Compilation For

Prevent This User From Interacting With Your Repositories And Sending

Please Reload This Page. There Was An Error While Loading.