Sitemap

Unlocking AI Potential in .NET with CPU: A Deep Dive into Phi-3 Vision with ONNX Runtime

5 min readDec 2, 2024

--

Introduction

Artificial Intelligence continues to shape the future of software development, and integrating state-of-the-art models into modern ecosystems like .NET has become more accessible than ever. In this post, we’ll explore a proof of concept showcasing the power of the microsoft/Phi-3-vision-128k-instruct-onnx-cpu model within the .NET environment, leveraging the capabilities of ONNX Runtime for CPU execution.

What is Phi-3 Vision?

Phi-3 Vision is a cutting-edge model designed by Microsoft for vision-based tasks, such as analyzing images and generating detailed textual descriptions. This model, optimized for CPU environments, is particularly useful for scenarios where GPU resources are unavailable.

In this guide, we’ll demonstrate how to integrate and run Phi-3 Vision in the .NET ecosystem, showcasing its potential through practical examples.

Setting Up the Environment

Before diving into the implementation, let’s prepare the setup. Follow these simple steps (tested on Windows):

1. Create and Activate a Python Virtual Environment

python -m venv venv
.\venv\Scripts\activate

2. Install HuggingFace CLI

pip install huggingface-hub[cli]

3. Download the Phi-3 Vision Model

huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cpu --include cpu-int4-rtn-block-32-acc-level-4/* --local-dir models/microsoft/Phi-3-vision-128k-instruct-onnx-cpu

4. Build the .NET Solution

dotnet build .\hello-phi-3-vision.sln -c Release

5. The source code key to run the model:

namespace hello_phi_3_vision
{
#region using

using Microsoft.ML.OnnxRuntimeGenAI;

#endregion

public interface IHelloPhi3VisionService
{
#region Methods

public void Run(string modelPath, string userPrompt, string imagePath);

#endregion
}

public class HelloPhi3VisionService : IHelloPhi3VisionService
{
#region Public Methods

public void Run(string modelPath, string userPrompt, string imagePath)
{
using var model = new Model(modelPath);
using var processor = new MultiModalProcessor(model);
using var tokenizerStream = processor.CreateStream();

var hasImage = !string.IsNullOrWhiteSpace(imagePath);

Images? image = Images.Load(imagePath);
Images? images = hasImage ? Images.Load(imagePath) : null;

var prompt = "<|user|>\n";
prompt += hasImage ? "<|image_1|>\n" : "";
prompt += userPrompt + "<|end|>\n<|assistant|>\n";

Console.WriteLine($"Processing...");
using var inputs = processor.ProcessImages(prompt, images);

Console.WriteLine($"Generating response...");

using var generatorParams = new GeneratorParams(model);
generatorParams.SetInputs(inputs);
generatorParams.SetSearchOption("max_length", 3072);

using var generator = new Generator(model, generatorParams);

Console.WriteLine("================ Output ================");

while (!generator.IsDone())
{
generator.ComputeLogits();
generator.GenerateNextToken();
var newTokens = generator.GetSequence(0);
var output = tokenizerStream.Decode(newTokens[^1]);
Console.Write(output);
}

Console.WriteLine();
Console.WriteLine("==========================================");

}

#endregion
}
}

See more:

Running the Phi-3 Vision Proof of Concept

With the environment ready, let’s execute the model and test its capabilities.

Command to Run the Example:

dotnet run --project .\hello-phi-3-vision.consoleapp\hello-phi-3-vision.consoleapp.csproj -c Release --model .\models\microsoft\Phi-3-vision-128k-instruct-onnx-cpu\cpu-int4-rtn-block-32-acc-level-4

Examples in Action

Example 1: Analyzing a MotoGP Poster

Input Image: A promotional poster for the 2024 Gran Premio GoPro de Aragón.

Model Output:

“The image is a promotional poster for the 2024 Gran Premio GoPro de Aragón motorcycle racing event. It features a group of motorcyclists in racing suits with sponsor logos, standing in front of a dark background that resembles a racetrack. The text ‘MOTORLAND ARAGON’ is prominently displayed in large white letters across the center of the poster. The event dates ’30–31 Ago 11 Sept’ are mentioned at the top, and the hashtag ‘#AragonGP’ is visible. The bottom of the poster includes logos of GoPro, FIM, and MOTORLAND, along with the text ‘GRAN PREMIO DE ARAGON 2024’ and ‘MOTORLAND Aragón’. The overall design suggests an official MotoGP poster.”

This result highlights the model’s ability to accurately interpret visual content and generate detailed descriptions.

Example 2: Reading a Floor Plan

Input Image: A floor plan of a residential building.

Model Output:

“The image presents a detailed floor plan of a residential building. The plan is divided into three distinct sections, each representing a different floor of the building.

On the left, we have the ground floor, labeled as “Planta Baja”. This floor is characterized by a spacious living area, a dining area, and a kitchen. The living area is centrally located, with the dining area positioned adjacent to it. The kitchen is situated towards the back of the living area.

The middle section of the plan represents the first floor, labeled as “Planta Alta”. This floor features a bedroom, a bathroom, and a balcony. The bedroom is located in the center of the floor, with the bathroom situated to its left. The balcony is situated on the right side of the bedroom.

The rightmost section of the plan represents the second floor, labeled as “Elevación Principal”. This floor is characterized by a living room, a dining room, and a kitchen. The living room is centrally located, with the dining room situated to its left and the kitchen to its right.

Each floor is interconnected, allowing for easy movement between the different areas of the building. The plan is meticulously detailed, providing a comprehensive overview of the building’s layout.”

Phi-3 Vision excels in architectural contexts, breaking down complex visual data into structured, understandable information.

Why Choose Phi-3 Vision in .NET?

  • CPU Optimization: Ideal for environments where GPU resources are limited.
  • Seamless .NET Integration: Powered by ONNX Runtime, it fits naturally into .NET solutions.
  • Versatile Applications: From marketing material analysis to architectural design, the possibilities are vast.

Learn More

Conclusion

The hello-phi-3-vision proof of concept demonstrates how AI models like Phi-3 Vision can be seamlessly integrated into the .NET ecosystem with CPU, paving the way for smarter and more efficient applications. Whether you’re exploring visual AI or enhancing your .NET toolkit, this project is a perfect starting point🌟

Github repository with source 🚀

--

--

Javier Carracedo
Javier Carracedo

Written by Javier Carracedo

Hi! SW Engineer from León (Spain) . I ❤ my work, I like improve and to grow my knowledge about different technologies. SW Engineer at HP SCDS - León (Spain)

No responses yet