Instructions to use adept/fuyu-8b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adept/fuyu-8b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="adept/fuyu-8b")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("adept/fuyu-8b") model = AutoModelForImageTextToText.from_pretrained("adept/fuyu-8b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use adept/fuyu-8b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adept/fuyu-8b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adept/fuyu-8b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/adept/fuyu-8b
- SGLang
How to use adept/fuyu-8b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adept/fuyu-8b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adept/fuyu-8b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adept/fuyu-8b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adept/fuyu-8b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use adept/fuyu-8b with Docker Model Runner:
docker model run hf.co/adept/fuyu-8b
Released capabilities
Hi,
as I understand the released model is not capable of the OCR, bbox_to_text and text_to_bbox, correct?
Are there any resources as how to go about finetuning the model for this?
Nice work and thank you!
Hi @ludeksvoboda , with the recent transformers release (run a pip install --upgrade transformers) the model should! Given bbox coordinates, it will perform OCR within that bbox.
from PIL import Image
import requests
import io
from transformers import FuyuForCausalLM, FuyuProcessor
pretrained_path = "adept/fuyu-8b"
processor = FuyuProcessor.from_pretrained(pretrained_path)
model = FuyuForCausalLM.from_pretrained(pretrained_path, device_map='auto')
bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n<box>388, 428, 404, 488</box>"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')
model_outputs = processor.batch_decode(model.generate(
**model_inputs, max_new_tokens=10)[:, -10:], skip_special_tokens=True)[0]
prediction = model_outputs.split('\x04 ', 1)[1] if '\x04' in model_outputs else ''
This should output Williams, the text contained within coordinates. text_to_bbox should work as well, with processor.post_process_box_coordinates. Have fun!
Hi, nice work!
I wonder how to use text_to_bbox to locate items. I try:
bbox_prompt = "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\\n 561 Dillman"
bbox_image_url = "https://huggingface.co/datasets/hf-internal-testing/fixtures-captioning/resolve/main/bbox_sample_image.jpeg"
bbox_image_pil = Image.open(io.BytesIO(requests.get(bbox_image_url).content))
model_inputs = processor(text=bbox_prompt, images=bbox_image_pil).to('cuda')
model_outputs = model.generate(**model_inputs, max_new_tokens=20)[:, -20:]
model_outputs = processor.post_process_box_coordinates(model_outputs)
model_outputs = processor.batch_decode(model_outputs, skip_special_tokens=True)[0]
print(model_outputs)
And it outputs text, generate the corresponding bounding box.\n Williams<box>388, 428, 404, 900</box>, is this the right way to use it?
@cckevinn try to have a look at this https://huggingface.co/adept/fuyu-8b/discussions/38 , but essentially you have it correct I think.
I have tried the linked solution and it works somewhat on the resized image (to 1/2 of the original size), very likely it does even better on the fullsized image. Tried also to crop the test image so it contains only the part filled with text (removed the white space on both sides) and it fails to generate any bbox, I either get empty string or part of some text. I think the model has problems with different image sizes.
Only thing I had to tinker with was permuting the coordinates for ploting.
def permute_bbox(bbox):
return (bbox[1], bbox[0], bbox[3], bbox[2])
def plot_bbox(img, bbox):
"""simplest way to plot bounding box on the image"""
if isinstance(img, np.ndarray):
img = Image.fromarray(img)
draw = ImageDraw.Draw(img)
draw.rectangle(bbox, outline='red')
return img
The bounding box tasks are very sensitive to input resolution, because the model was trained on screenshots with a height of 1080 and not fine-tuned on other content. The best way to use these features is to scale your input image so its height is close to 1080. If the image is smaller, then padding to 1920x1080 works well.
This is the strategy used in the demo for this task, these are the lines that rescale and pad so the input to the model is always 1920x1080: https://huggingface.co/spaces/adept/fuyu-8b-demo/blob/main/app.py#L71-L72