Write scalar.
Execute parallel.
Ark is a compute-native language that treats tensors, hardware cost, and distributed state as first-class primitives. You write clear intent—the compiler produces hyper-optimized kernels and deterministic grid dispatch.
__global__ void matrixMul(float *A, float *B, float *C, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
if (row < N && col < N) {
for (int i = 0; i < N; i++) {
sum += A[row * N + i] * B[i * N + col];
}
C[row * N + col] = sum;
}
}
// + 100 lines of brittle memory management...// Matrix Multiplication
fn[gpu] matmul(a: Tensor<f32, 2>, b: Tensor<f32, 2>) -> Tensor<f32, 2> {
// Ark natively handles tiling, shared memory layout,
// and deterministic kernel dispatch.
return a @ b;
}Forget malloc and cudaMemcpy.
High-performance kernels shouldn’t require hand-rolled pointer arithmetic, manual barrier synchronization, and endless launch tuning. Ark keeps the source logic mathematically pure, while the compiler and runtime handle optimal memory placement and hardware scheduling.
- Zero-cost tensor abstractions (no manual launch params)
- Compile-time verification of multidimensional shapes
- Deterministic kernels across heterogeneous GPUs
- Placement hints are explicit, readable, and enforceable
let y = matmul(a, b) @runtime preset("prod");
let y = matmul(a, b) @runtime { target: "gpu:0" };
Implicit parallelism
Write code as if it runs sequentially. The Ark compiler analyzes dependencies and mathematically proves how to parallelize operations across thousands of GPU cores.
- Dependency analysis
- Auto-tiling + loop fusion
- Predictable hardware scheduling
Resource aware
The type system tracks exact VRAM footprints and hardware constraints. If your requested tensor allocations exceed the target's capacity, compilation fails instantly.
- VRAM usage estimation
- Constraint propagation
- Fail-fast deployment grids
Hardware agnostic
Target Nvidia, AMD, and edge devices from one pure codebase. Ark serves as a universal intermediate representation with deterministic, bit-exact lowering.
- Multi-backend lowering
- Stable Abstract IR
- Portable cryptographic artifacts
A compiler built for correctness,
then ruthless optimization.
Ready to port your kernels?
Install the compiler, run through the interactive language tour, and deploy your first mathematically proven kernel to the grid.