Please reach us at admin@watercloud1.com if you cannot find an answer to your question.
Inference is the process of using a trained AI model to produce results in real time.
In operational terms, inference is the live production workload running inside the data centre. It is the activity that generates revenue from AI infrastructure.
Training builds the model.
Inference runs the service.
Inference runs continuously and consumes the majority of compute resources.
Most AI data centres operate with:
80–95% inference workloads
5–20% training workloads
Inference determines:
Acceleration refers to specialised hardware or software designed to increase the speed and efficiency of compute workloads.
Accelerators perform complex calculations significantly faster than standard processors.
Acceleration reduces:
Inference is the workload.
Acceleration is the technology that enables the workload to run efficiently.
Inference generates output.
Acceleration improves performance.
An accelerator is a processor designed specifically for high-performance computing tasks such as artificial intelligence, machine learning, and data analytics.
Common accelerator types include:
GPU utilisation is the percentage of time a GPU is actively processing workloads.
Example:
100 GPUs installed
80 GPUs running workloads
GPU utilisation:
80%
High utilisation indicates efficient infrastructure usage and strong revenue performance.
Latency is the time required for a system to process a request and return a response.
Latency is measured in milliseconds.
Lower latency results in:
Throughput is the total amount of work processed by a system within a specific time period.
Examples:
Higher throughput indicates higher system capacity and productivity.
Rack density is the amount of electrical power consumed by equipment installed in a single rack.
Measured in:
kilowatts (kW) per rack
Typical ranges:
Traditional data centre:
5–10 kW per rack
AI data centre:
30–80 kW per rack
Next-generation AI infrastructure:
100–150 kW per rack
Performance per watt measures the amount of computing work delivered for each unit of electricity consumed.
This metric is critical because electricity is the largest operating cost in modern data centres.
Higher performance per watt results in:
Revenue per rack is determined by three primary factors:
Inference throughput
×
Utilisation rate
×
Service pricing
Supporting factors include:
An inference cluster is a group of servers dedicated to running AI workloads in production.
Typical components include:
Inference clusters operate continuously and support real-time applications.
Electricity is the largest operating cost.
Typical operating cost distribution:
Power:
40–60%
Cooling:
20–30%
Hardware depreciation:
15–25%
Operations and staffing:
5–10%
Accelerators generate significant heat during operation.
Effective cooling prevents:
Modern AI facilities commonly use:
An inference-optimised data centre is designed to support continuous, high-volume production workloads.
Key characteristics include:
Scaling refers to increasing infrastructure capacity to support additional workloads.
Two scaling methods:
Vertical scaling:
Increasing the performance of a single system.
Horizontal scaling:
Adding more systems to the environment.
Modern AI data centres rely primarily on horizontal scaling.
Capacity is the maximum infrastructure capability.
Utilisation is the percentage of that capacity currently in use.
Example:
Installed capacity:
100 GPUs
Active usage:
70 GPUs
Utilisation:
70%
SLA stands for Service Level Agreement.
An SLA defines the performance and reliability commitments of a service provider.
Typical SLA metrics include:
Standard enterprise SLA target:
99.99% uptime
Networking determines how quickly data moves between systems within the data centre.
High-performance networking enables:
Common data centre networking technologies include:
Consistent infrastructure utilisation.
High utilisation ensures:
Idle infrastructure generates cost but no revenue.
Compute capacity is directly limited by available electrical power.
More power enables:
Power is the primary constraint in modern AI infrastructure.
CPU workloads handle general-purpose computing tasks.
Examples:
Accelerator workloads handle high-performance computing tasks.
Examples:
A high-performance AI data centre delivers:
The operational objective is to maximise compute output while minimising energy consumption and downtime.
This is achieved through:
The core principle is:
Inference drives revenue.
Acceleration drives efficiency.
Utilisation drives profitability.
Copyright © 2026 Watercloud International (Singapore) Pte. Ltd.- All Rights Reserved.
Powered by WAFEnity™
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.