Serving Falcon models with 🤗 Text Generation Inference (TGI)

5 min readJun 11, 2023

Run your LLM eficiently with TGI and LangChain integration

🤗 Text Generation Inference architecture. Credits by: TGI Repo.

In previous post, we see as run your private Falcon-7b-Instruct in a single GPU of 6GB using quantization. Today, I’ll show how to run Falcon models on-premise and in the cloud. Notebook to reproduce here.

Run your private LLM: Falcon-7B-Instruct with less than 6GB of GPU using 4-bit quantization

Building with BitsAndBytes, HuggingFace and LangChain

vilsonrodrigues.medium.com

🤗 Text Generation Inference is a model serving production-ready designed by HuggingFace to power LLMs apps easily. It is powered by Python, Rust and gRPC. Under by Apache 2.0 license.

TGI Features

Some features listed in TGI repo are:

Serve the most popular Large Language Models with a simple launcher
Tensor Parallelism for faster inference on multiple GPUs
Token streaming using Server-Sent Events (SSE)
Continuous batching of incoming requests for increased total throughput
Optimized transformers code for inference using flash-attention and Paged Attention on the most popular architectures
Quantization with bitsandbytes and GPT-Q
Safetensors weight loading
Watermarking with A Watermark for Large Language Models
Logits warper (temperature scaling, top-p, top-k, repetition penalty, more details see transformers.LogitsProcessor)
Stop sequences
Log probabilities
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Optimized architectures

Other architectures are supported on a best effort basis using:

AutoModelForCausalLM.from_pretrained(<model>, device_map="auto")

AutoModelForSeq2SeqLM.from_pretrained(<model>, device_map="auto")

TGI Routers

The best way to explore TGI routers is the swagger page:

Text Generation Inference API

Swagger

huggingface.github.io

/info — [GET] — Text Generation Inference endpoint info
/metrics — [GET] — Prometheus metrics scrape endpoint
/generate — [POST] — Generate tokens
/generate_stream — [POST] — Generate a stream of token using Server-Sent Events
/ — [POST] — Generate tokens if stream == false or a stream of token if stream == true

Serving

🤗 provide a Docker image (9.32GB):

ghcr.io/huggingface/text-generation-inference

TGI Parameters

TGI configs

 - model-id <MODEL_ID>
 - revision <REVISION> 
 - sharded <SHARDED>
 - num-shard <NUM_SHARD>
 - quantize <QUANTIZE>
 - trust-remote-code 
 - max-concurrent-requests <MAX_CONCURRENT_REQUESTS> 
 - max-best-of <MAX_BEST_OF> 
 - max-stop-sequences <MAX_STOP_SEQUENCES>
 - max-input-length <MAX_INPUT_LENGTH>
 - max-total-tokens <MAX_TOTAL_TOKENS> 
 - max-batch-size <MAX_BATCH_SIZE> 
 - waiting-served-ratio <WAITING_SERVED_RATIO>
 - max-batch-total-tokens <MAX_BATCH_TOTAL_TOKENS>
 - max-waiting-tokens <MAX_WAITING_TOKENS>
 - port <PORT>
 - shard-uds-path <SHARD_UDS_PATH>
 - master-addr <MASTER_ADDR>
 - master-port <MASTER_PORT> 
 - huggingface-hub-cache <HUGGINGFACE_HUB_CACHE> 
 - weights-cache-override <WEIGHTS_CACHE_OVERRIDE> 
 - disable-custom-kernels 
 - json-output 
 - otlp-endpoint <OTLP_ENDPOINT> 
 - cors-allow-origin <CORS_ALLOW_ORIGIN> 
 - watermark-gamma <WATERMARK_GAMMA> 
 - watermark-delta <WATERMARK_DELTA> 
 - env

Wow, many options 😬, but it’s good, excellent work by 🤗 devs 👨‍💻. Some params we will to use as env:

# share a volume with the Docker container to avoid 
# downloading weights every run
volume=$PWD/data

# model on 🤗 Hub
model=tiiuae/falcon-7b-instruct
# apply quant to reduce gpu consume
quantize=bitsandbytes

Falcon-7b models do not support Shard (Tensor Parallelism).

num_shard=1

Observations (update in 06/15/23)

Can be necessary add

--trust-remote-code

in Docker Args, although it was theoretically solved in https://github.com/huggingface/text-generation-inference/pull/396

2. For Falcon-40B models that allows shard, you should add:

--sharded true --num-shard NUM_GPUS

3. bitsandbytes default quantization was designed for training, not inference, it is expected to be slower than normal. Quantization in GPT-Q and SQPR is already being added, should improve the inference.

4. Volta GPUs (e.g. V100) not support Falcon Models

Error deploying docker on P3 instance · Issue #424 · huggingface/text-generation-inference

System Info docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data…

github.com

Updates (07/01/23)

TGI v0.9.0 support now:

PagedAttention (vLLM)
GPT-Q (quantization)

Updates (07/13/23)

Falcon models now it has official support by HuggingFace. trust-remote-codeit is no longer necessary.

Updates (07/18/23)

TGI supports LLaMA 2 models and integrate Flash Attention V2.

Run in On-premise environment

You will need to configure NVIDIA Container Toolkit to use GPUs.

docker run --gpus all --shm-size 1g -p 8080:80 \ 
        -v $volume:/data \ 
        ghcr.io/huggingface/text-generation-inference:latest \
        --model-id $model --num-shard $num_shard \
        --quantize $quantize

Bash tests

curl 127.0.0.1:8080/generate \
     -X POST \
     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
     -H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
    -H 'Content-Type: application/json'

curl 127.0.0.1:8080/ \
    -X POST \
    -d '{"inputs":"What is Deep Learning?",
          "parameters":{"max_new_tokens":17},
          "stream": True}' \
    -H 'Content-Type: application/json'

TGI Client

!pip install text-generation

from text_generation import Client

# Generate
client = Client("http://127.0.0.1:8080")
print(client.generate("What is Deep Learning?", max_new_tokens=17).generated_text)

# Generate stream
text = ""
for response in client.generate_stream("What is Deep Learning?", max_new_tokens=17):
    if not response.token.special:
        text += response.token.text
print(text)

LangChain

# Some error in colab. fix with
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip install langchain transformers

# Wrapper to TGI client with langchain

from langchain.llms import HuggingFaceTextGenInference

inference_server_url_local = "http://127.0.0.1:8080"

llm_local = HuggingFaceTextGenInference(
    inference_server_url=inference_server_url_local,
    max_new_tokens=400,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.7,
    repetition_penalty=1.03,
)

from langchain import PromptTemplate, LLMChain

template = """Question: {question}
Answer: Let's think step by step."""

prompt = PromptTemplate(
    template=template, 
    input_variables= ["question"]
)

llm_chain_local = LLMChain(prompt=prompt, llm=llm_local)

llm_chain_local("your question")

RunPod

RunPod is a cloud computing platform, primarily designed for AI and machine learning applications. Our key offerings include GPU Instances, Serverless GPUs, and AI Endpoints. — (https://docs.runpod.io/docs/about)

🤑Pricing

Based-on previous tests in Colab. 1 RTX 4080 it’s enough to run falcon-7b-instruct quantized.

!pip install runpod

import runpod

# your key
runpod.api_key = '...'

num_shard = 1
model_id = "tiiuae/falcon-7b-instruct"
quantize = "bitsandbytes"

pod = runpod.create_pod(
    name="Falcon-7B-Instruct-POD",
    image_name="ghcr.io/huggingface/text-generation-inference:latest",
    gpu_type_id="NVIDIA GeForce RTX 4080",
    cloud_type="COMMUNITY",
    docker_args=f"--model-id {model_id} --num-shard {num_shard} --quantize {quantize}",
    gpu_count=num_shard,
    volume_in_gb=50,
    container_disk_in_gb=5,
    ports="80/http",
    volume_mount_path="/data",
)

from langchain.llms import HuggingFaceTextGenInference

inference_server_url_cloud = f"https://{pod["id"]}-80.proxy.runpod.net"

llm_cloud = HuggingFaceTextGenInference(
    inference_server_url=inference_server_url_cloud,
    max_new_tokens=1000,
    top_k=10,
    top_p=0.95,
    typical_p=0.95,
    temperature=0.3,
    repetition_penalty=1.03,
)

llm_chain_cloud = LLMChain(prompt=prompt, llm=llm_cloud)

Test

llm_chain_cloud("your new question to falcon")

# stop pod
runpod.stop_pod(pod["id"])

# terminate
runpod.terminate_pod(pod["id"])

AWS Support to TGI

Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker | Amazon Web…

This post is co-written with Philipp Schmid and Jeff Boudier from Hugging Face. Today, as part of Amazon Web Services'…

aws.amazon.com

Considerations

TGI is a fantastic tool to run Large Language Models efficiently and production-ready. I hope that I could have helped you!

🤠 Thanks for reading. See you later in the next posts.

References

Thanks Pavel to show how to use on RunPod