

Phi Hosting Service: Self-Host Phi3, Phi4, Phi3.5, and Phi4-mini Efficiently

Phi Hosting provides optimized infrastructure for deploying Microsoft's lightweight yet high-performance Phi family of language models, including Phi-3, Phi-3.5, Phi-4, Phi-4-Mini, and Phi-4-Reasoning. These models are designed for efficiency and reasoning tasks, with smaller parameter sizes (ranging from ~1.3B to ~14B) but surprisingly strong capabilities in commonsense, coding, and instruction following. Phi models can be hosted using vLLM, Transformers + TGI, or Ollama for quantized formats (GGUF/INT4).

Phi Hosting with Ollama — GPU Recommendation

Ollama abstracts away the complexity of local LLM hosting with an OpenAI-compatible API, making it easy to run Phi models on laptops, desktops, or lightweight servers. This setup is perfect for developers building intelligent assistants, reasoning agents, or on-device chatbots.

Model Name	Size (4-bit Quantization)	Recommended GPUs	Tokens/s
phi:2.7b	1.6GB	P1000 < GTX1650 < GTX1660 < RTX2060 < RTX5060	19.46~132.97
phi3:3.8b phi4-mini:3.8b	2.2GB	P1000 < GTX1650 < GTX1660 < RTX2060 < RTX5060	18.87-75.94
phi3:14b	7.9GB	A4000 < V100	38.46-67.51
phi4:14b	9.1GB	A4000 < V100	30.20-48.63

Phi Hosting with vLLM + Hugging Face — GPU Recommendation

Using vLLM ensures optimal GPU memory utilization and fast token generation, while Hugging Face Transformers provides access to the latest model variants and formats. This hosting stack is ideal for building reasoning engines, chatbots, and AI agents powered by the efficient Phi family.

Model Name	Size (16-bit Quantization)	Recommended GPU(s)	Concurrent Requests	Tokens/s
microsoft/Phi-3.5-vision-instruct	~8.8GB	V100 < A5000 < RTX4090	50	~2000-6000

✅ Explanation:

Recommended GPUs: From left to right, performance from low to high
Tokens/s: from benchmark data.

Choose The Best GPU Plans for Phi Service Hosting

All Plans
New Arrivals
Promotions

product line:
GPU VPS
GPU Dedicated Server

GPU Use Scenario:
Live Streaming
HD Gaming
3D Rendering
Video Editing
AI&Deep Learning
CAD/CGI/DCC

GPU Memory:
2 GB
4 GB
6 GB
8 GB
16 GB
24 GB
32 GB
40 GB
48 GB
72 GB
80 GB
96 GB
144 GB
160 GB
192 GB

GPU Card Model:
GT 730
P600
P1000
T1000
GTX 1650
GTX 1660
RTX 2060
RTX 3060 Ti
RTX 4060
RTX 5060
RTX A4000
RTX Pro 2000
RTX A5000
RTX Pro 4000
RTX A6000
RTX Pro 5000
RTX Pro 6000
RTX 4090
RTX 5090
A100
H100
K80
V100
P100
A40

Express GPU VPS - 2GB

$ 17.98/mo

38% OFF (Was $29.00)

1mo3mo12mo24mo

Order Now

GPU Model: GT730|P600|K620
CPU: 8 CPU Cores
Memory: 16GB RAM
Disk: 120GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 4 Weeks

Lite Dedicated GPU Server - P600

$ 49.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: P600
CPU: 4-Core Xeon E3-1230
Memory: 16GB RAM
Disk: 120GB SSD+960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Express Dedicated GPU Server - P1000

$ 40.70/mo

45% OFF (Was $74.00)

1mo3mo12mo24mo

Order Now

GPU Model: P1000
CPU: 8-Core Xeon E5-2690
Memory: 32GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Basic Dedicated GPU Server - K80

$ 109.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: K80
CPU: 8-Core Xeon E5-2690
Memory: 64GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Basic GPU VPS - RTX 5060

$ 85.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX 5060
CPU: 16 CPU Cores
Memory: 28GB RAM
Disk: 240GB SSD
Bandwidth: 200Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 4 Weeks

Basic Dedicated GPU Server - T1000

$ 99.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: T1000
CPU: 8-Core Xeon E5-2690
Memory: 64GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Basic Dedicated GPU Server - GTX 1650

$ 59.50/mo

50% OFF (Was $119.00)

1mo3mo12mo24mo

Order Now

GPU Model: GTX 1650
CPU: 8-Core Xeon E5-2667v3
Memory: 64GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Professional GPU VPS - RTX Pro 2000

$ 95.20/mo

20% OFF (Was $119.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX Pro 2000
CPU: 16 CPU Cores
Memory: 28GB RAM
Disk: 240GB SSD
Bandwidth: 300Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 2 Weeks

Basic Dedicated GPU Server - GTX 1660

$ 71.55/mo

55% OFF (Was $159.00)

1mo3mo12mo24mo

Order Now

GPU Model: GTX 1660
CPU: 16-Core Dual E5-2660
Memory: 64GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Professional GPU VPS - RTX A4000

$ 119.00/mo

20% OFF (Was $149.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX A4000
CPU: 24 CPU Cores
Memory: 28GB RAM
Disk: 320GB SSD
Bandwidth: 300Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 2 Weeks

Basic Dedicated GPU Server - RTX 4060

$ 89.50/mo

50% OFF (Was $179.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX 4060
CPU: 8-Core Xeon E5-2690
Memory: 64GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Basic Dedicated GPU Server - RTX 5060

$ 159.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX 5060
CPU: 24-Core Platinum 8160
Memory: 64GB RAM
Disk: 120GB SSD+960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Professional Dedicated GPU Server - P100

$ 89.50/mo

55% OFF (Was $199.00)

1mo3mo12mo24mo

Order Now

GPU Model: P100
CPU: 16-Core Dual E5-2660
Memory: 128GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Professional Dedicated GPU Server - RTX 2060

$ 159.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX 2060
CPU: 16-Core Dual E5-2660
Memory: 128GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Advanced GPU VPS - RTX Pro 4000

$ 159.00/mo

20% OFF (Was $199.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX Pro 4000
CPU: 24 CPU Cores
Memory: 56GB RAM
Disk: 320GB SSD
Bandwidth: 500Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 2 Weeks

Advanced Dedicated GPU Server - RTX 2060

$ 179.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX 2060
CPU: 40-Core Dual Gold 6148
Memory: 128GB RAM
Disk: 120GB SSD + 960GB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Advanced Dedicated GPU Server - RTX 3060 Ti

$ 107.55/mo

55% OFF (Was $239.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX 3060 Ti
CPU: 24-Core Dual E5-2697v2
Memory: 128GB RAM
Disk: 240GB SSD+2TB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Advanced Dedicated GPU Server - RTX A4000

$ 209.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX A4000
CPU: 24-Core Dual E5-2697v2
Memory: 128GB RAM
Disk: 240GB SSD+2TB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Advanced Dedicated GPU Server - V100

$ 131.56/mo

56% OFF (Was $299.00)

1mo3mo12mo24mo

Order Now

GPU Model: V100
CPU: 24-Core Dual E5-2690v3
Memory: 128GB RAM
Disk: 240GB SSD+2TB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Advanced Dedicated GPU Server - RTX A5000

$ 269.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX A5000
CPU: 24-Core Dual E5-2697v2
Memory: 128GB RAM
Disk: 240GB SSD+2TB SSD
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Advanced GPU VPS - RTX Pro 5000

$ 269.00/mo

23% OFF (Was $349.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX Pro 5000
CPU: 24 CPU Cores
Memory: 56GB RAM
Disk: 320GB SSD
Bandwidth: 500Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 2 Weeks

Advanced GPU VPS - RTX 5090

$ 399.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX 5090
CPU: 32 CPU Cores
Memory: 84GB RAM
Disk: 400GB SSD
Bandwidth: 500Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 2 Weeks

Enterprise Dedicated GPU Server - RTX 4090

$ 307.44/mo

44% OFF (Was $549.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX 4090
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Dedicated GPU Server - RTX A6000

$ 329.40/mo

40% OFF (Was $549.00)

1mo3mo12mo24mo

Order Now

GPU Model: RTX A6000
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Dedicated GPU Server - A40

$ 439.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: A40
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Dedicated GPU Server - RTX 5090

$ 479.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX 5090
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise GPU VPS - RTX Pro 6000

$ 479.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: RTX Pro 6000
CPU: 32 CPU Cores
Memory: 84GB RAM
Disk: 400GB SSD
Bandwidth: 1000Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA
Backup: Once per 2 Weeks

Enterprise Multi-GPU Dedicated Server - 3xV100

$ 469.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: 3 x V100
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 1000Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Multi-GPU Dedicated Server - 3xRTX A5000

$ 539.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: 3 x RTX A5000
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 1000Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Dedicated GPU Server - A100

$ 359.55/mo

55% OFF (Was $799.00)

1mo3mo12mo24mo

Order Now

GPU Model: A100
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Multi-GPU Dedicated Server - 2xRTX 4090

$ 729.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: 2 x RTX 4090
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 1000Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Multi-GPU Dedicated Server - 2xRTX 5090

$ 859.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: 2 x RTX 5090
CPU: 44-core Dual E5-2699v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 1000Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Multi-GPU Dedicated Server - 3xRTX A6000

$ 899.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: 3 x RTX A6000
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 1000Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Multi-GPU Dedicated Server - 4xRTX A6000

$ 1199.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: 4 x RTX A6000
CPU: 44-core Dual E5-2699v4
Memory: 512GB RAM
Disk: 240GB SSD+4TB NVMe+16TB SATA
Bandwidth: 1000Mbps Unmetered
NVLink: 2xNVLink

IP: 1 Dedicated IPv4
Location: USA

Enterprise Dedicated GPU Server - A100(80GB)

$ 1559.00/mo

8% OFF (Was $1699.00)

1mo3mo12mo24mo

Order Now

GPU Model: A100(80GB)
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

Enterprise Multi-GPU Dedicated Server - 4xA100

$ 1899.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: 4 x A100
CPU: 44-core Dual E5-2699v4
Memory: 512GB RAM
Disk: 240GB SSD+4TB NVMe+16TB SATA
Bandwidth: 1000Mbps Unmetered
NVLink: 6xNVLink

IP: 1 Dedicated IPv4
Location: USA

Enterprise Dedicated GPU Server - H100

$ 2099.00/mo

1mo3mo12mo24mo

Order Now

GPU Model: H100
CPU: 36-Core Dual E5-2697v4
Memory: 256GB RAM
Disk: 240GB SSD+2TB NVMe+8TB SATA
Bandwidth: 100Mbps Unmetered

IP: 1 Dedicated IPv4
Location: USA

What is Microsoft Phi Hosting?

Microsoft Phi Hosting is to the deploy and serve of Microsoft’s lightweight language models—such as Phi-3, Phi-3.5, Phi-4, Phi-4-Mini, and Phi-4-Reasoning—on dedicated infrastructure or cloud environments. These models are optimized for reasoning, efficiency, and fast inference, making them ideal for lightweight AI applications.

Self-hosted Phi Hosting means running the models on your own servers or edge devices. You can use tools like Ollama, vLLM, or Transformers to serve Phi models with full control over hardware, latency, data privacy, and model behavior.

In contrast, Phi as a Service lets you access the Phi models via public cloud APIs—typically through providers like Azure, Hugging Face Inference Endpoints, or hosted APIs by third parties.

LLM Benchmark Test Results for Microsoft Phi Service

Tests were conducted across multiple serving backends (e.g., vLLM, Ollama, Hugging Face Transformers) and GPU configurations to evaluate real-world performance under different quantization levels (FP16, INT8, AWQ, GGUF).

Ollama Benchmark for Microsoft Phi

This benchmark evaluates the performance of Microsoft’s Phi language models—including Phi-3, Phi-3.5, Phi-4, and Phi-4-Mini—when hosted using the Ollama inference engine. Ollama supports GGUF quantized formats, enabling efficient local deployment with minimal hardware requirements. The benchmark includes startup time, token generation speed (tokens per second), VRAM usage, and responsiveness across different GPU classes (RTX 3060, 3090, 4090, etc.).

vLLM Benchmark for Microsoft Phi

This benchmark measures the inference performance of Microsoft's Phi language models—including Phi-3, Phi-3.5, Phi-4, Phi-4-Mini, and Phi-4-Reasoning—using the vLLM inference engine with models served from Hugging Face in full-precision or AWQ quantized formats. The test evaluates key metrics such as token throughput, latency, GPU memory usage, and scalability under concurrent requests.

How to Self-host Microsoft Phi4 with Ollama/vLLM

Install and Run Microsoft Phi Locally with Ollama >

Ollama is a self-hosted AI solution to run open-source large language models, such as DeepSeek, Gemma, Llama, Mistral, and other LLMs locally or on your own infrastructure.

Install and Run Microsoft Phi Locally with vLLM >

vLLM is an optimized framework designed for high-performance inference of Large Language Models (LLMs). It focuses on fast, cost-efficient, and scalable serving of LLMs.

What Does Microsoft Phi4 Hosting Stack Include?

Hosting Phi4 models efficiently requires a robust software and hardware stack. A typical Phi LLM hosting stack includes the following components:

Hardware Stack

✅ High-memory GPUs: RTX 4090, A5000, or A100 40GB for full-precision or concurrent workloads

✅ CPU: Multi-core (8+ cores) for fast data loading and support processes

✅ RAM: 32GB+ system memory recommended to support model loading and runtime stability

✅ Storage: NVMe SSD for fast model loading (at least 50–100GB free space for multiple variants)

Software Stack

✅ Model Format: Hugging Face Transformers, GGUF (for llama.cpp/Ollama), or AWQ/GPTQ quantized weights

✅ Inference Engine: vLLM, Ollama, llama.cpp

✅ Serving Tools: FastAPI, OpenAI-compatible APIs, TGI (Text Generation Inference), Docker

✅ Optional Add-ons: LoRA fine-tuning loaders, quantization tools (AutoAWQ, GPTQ), monitoring stack (Prometheus, Grafana)

Why Phi Hosting Needs a Specialized Hardware + Software Stack

Optimized for Lightweight Yet Demanding Models

Despite being smaller than many LLMs, Phi models like Phi-4 and Phi-4-Reasoning are optimized for complex reasoning and instruction following, which demands efficient memory management and fast token generation—necessitating well-configured GPUs and inference engines.

Support for Quantized and Full-Precision Variants

Phi models are available in formats like FP16, AWQ, and GGUF (INT4/INT8). Hosting them efficiently requires software that supports format-specific optimizations—such as vLLM for AWQ and Ollama for GGUF—to balance performance and hardware resource usage.

Low Latency, High Throughput Needs

Whether self-hosted or serving users via API, Phi hosting requires real-time responsiveness. Engines like vLLM or TGI are designed for dynamic batching and asynchronous execution, which standard model runtimes can’t handle well under load.

Hardware Constraints and Deployment Flexibility

Phi models are often used in low-cost or edge scenarios, so selecting the right GPU memory size and architecture is critical. The hosting stack must be optimized for deployment on everything from consumer GPUs (like RTX 3060/3090) to enterprise-grade cards (A100/4090) to ensure cost-effective scalability.

Self-hosted Phi Hosting vs. Phi as a Service

Feature	Self-hosted Phi Hosting	Phi as a Service
Infrastructure Ownership	You own and manage the server and GPU resources	Fully managed by third-party providers
Model Control & Customization	Full control over model version, quantization, and config	Limited or no control over model internals
Latency & Performance	Optimized for local or on-prem use, low latency possible	May experience higher latency due to remote hosting
Privacy & Data Security	High — data stays on your hardware	Depends on provider policies and cloud environment
Scalability	Manual — add more hardware or scale vertically	Easy to scale — provider handles infrastructure
Initial Setup Complexity	Requires setup: GPU drivers, inference engines, etc.	No setup needed — ready-to-use APIs
Operating Cost	Higher upfront cost, lower long-term cost	Pay-as-you-go; higher cost over time
Ideal For	Developers, startups, enterprises with infra expertise	Prototyping, low-traffic apps, quick deployments
Example Tools	vLLM, Ollama, Hugging Face Transformers, llama.cpp	Azure AI Studio, Hugging Face Inference Endpoints

FAQs: Microsoft Phi 2.7B/3.8B/14B Models Hosting Service

What are the system requirements for hosting Phi Service?



Phi-2.7B / 3.8B can run efficiently on GPUs with 8–16GB VRAM, especially in quantized formats (e.g., GGUF or AWQ). Phi-14B requires at least 24GB VRAM for quantized inference, and 40GB+ (like A100) for full-precision (FP16/FP32) inference.

Which inference engines support Phi Service?



Ollama (for GGUF format; great for local quantized models)

vLLM (for AWQ/FP16/FP32 models; optimized for throughput and batching)

Transformers + TGI (for REST API deployments)

llama.cpp (for edge or lightweight environments)

Can I run Phi Service on CPU?



Technically yes, especially the Phi-2.7B in INT4 format using llama.cpp. However, performance will be very slow without GPU acceleration.

Are there quantized versions of Phi Service?



Yes. Most Phi models (including Phi-3 and Phi-14B) are available in GGUF (INT4/INT8) and AWQ (Weight-only quantization) formats, reducing memory usage while preserving reasonable performance.

What are the recommended GPUs?



For Phi-2.7B / 3.8B: RTX 3060, 4060 Ti, A4000 (with 8–16GB VRAM). For Phi-14B: RTX 4090, A100 (24–40GB VRAM depending on precision level)

Keywords:

Phi hosting, Phi model, Phi 14B hosting, Phi 3.8B Ollama, Phi 2.7B vLLM, self-hosted Phi, Phi GPU, deploy phi models, phi, ollama phi, AWQ, GGUF, phi4 reasoning