carrot/tinygrad_repo/docs/developer/kernelize.md

# Kernel Creation

Tinygrad lazily builds up a graph of Tensor operations. The Tensor graph includes a mix of:

- Buffer and Assignment Ops: `BUFFER`, `BUFFER_VIEW`, `COPY`, `ASSIGN`
- Movement Ops: `RESHAPE`, `EXPAND`, `PERMUTE`, `PAD`, `SHRINK`, `FLIP`
- Compute Ops: `ADD`, `MUL`, `REDUCE_AXIS`, ...

`Tensor.kernelize` creates the kernels and buffers needed to realize the output Tensor(s).

## Kernelize flow

Let's see how a multiply add Tensor graph becomes a fused elementwise kernel.

```py
# initialize 3 input buffers on the device
a = Tensor([1]).realize()
b = Tensor([2]).realize()
c = Tensor([3]).realize()

# create the Tensor graph
mul = a*b
out = mul+c

print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>
print(out) # <Tensor <UOp METAL (1,) int (<Ops.ADD: 52>, None)> on METAL with grad None>

out.kernelize()

print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>
print(out) # <Tensor <UOp METAL (1,) int (<Ops.ASSIGN: 66>, None)> on METAL with grad None>
```

The multiply Tensor stays the same because it is fused. The output Tensor's UOp becomes a new ASSIGN UOp:

```py
print(out.uop)
```

The first source is the output BUFFER:

```
UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
  UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),
  UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),))
```

And the second source is the KERNEL and its 4 buffer edges (output_buffer, a, b, c):

```
UOp(Ops.KERNEL, dtypes.void, arg=<Kernel 12 SINK(<Ops.STORE: 45>,) (__add__, __mul__)>, src=(
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
    x1:=UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),
    UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),)),
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
     x1,
    UOp(Ops.UNIQUE, dtypes.void, arg=1, src=()),)),
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
     x1,
    UOp(Ops.UNIQUE, dtypes.void, arg=3, src=()),)),
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
     x1,
    UOp(Ops.UNIQUE, dtypes.void, arg=5, src=()),)),))
```

KERNEL describes the compute AST, metadata and memory dependencies.

BUFFER holds a reference to the device memory where the output will be stored.

Once a Tensor is kernelized, all children will LOAD its BUFFER, instead of fusing it:

```py
child = out+2
child.kernelize()
print(child.uop.src[1].arg.ast)
```

```
UOp(Ops.SINK, dtypes.void, arg=None, src=(
  UOp(Ops.STORE, dtypes.void, arg=None, src=(
    UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=0, src=()),
    x2:=UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(1,), strides=(0,), offset=0, mask=None, contiguous=True),)), src=()),
    UOp(Ops.ADD, dtypes.int, arg=None, src=(
      UOp(Ops.LOAD, dtypes.int, arg=None, src=(
        UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=1, src=()),
         x2,)),
      UOp(Ops.CONST, dtypes.int, arg=2, src=(
         x2,)),)),)),))
```

`Tensor.realize` will execute the kernels and write outputs to memory:

```py
Tensor.realize(out)
print(out)        # <Tensor <UOp METAL (1,) int (<Ops.BUFFER: 23>, <buf real:True device:METAL size:1 dtype:dtypes.int offset:0>)> on METAL with grad None>
print(out.item()) # 5
```

<hr />

**Summary**

- The large Tensor graph is built from a mix of data, compute and movement Ops.

- `Tensor.kernelize` splits the Tensor graph into data (BUFFER), compute (KERNEL) and links dependencies with ASSIGN.

- `Tensor.realize` executes KERNELs on device and replaces the Tensor graph with just a BUFFER.

- Kernelize can be called multiple times on a Tensor. This allows for incrementally building the kernel fusion layout of a large Tensor graph, without having to call `realize` or `schedule`.
KerryGoldModel, AGNOS12.3, ButtonMode3, autoDetectLFA2, (#181) * fix.. speed_limit error... * draw tpms settings. * fix.. traffic light stopping only.. * fix.. waze cam * fix.. waze... * add setting (Enable comma connect ) * auto detect LFA2 * fix.. cruisespeed1 * vff2 driving model. * fix.. * agnos 12.3 * fix.. * ff * ff * test * ff * fix.. drawTurnInfo.. * Update drive_helpers.py * fix.. support eng voice eng sounds fix settings... english fix.. mph.. fix.. roadlimit speed bug.. * new vff model.. 250608 * fix soundd.. * fix safe exit speed.. * fix.. sounds. * fix.. radar timeStep.. * KerryGold model * Update drive_helpers.py * fix.. model. * fix.. * fix.. * Revert "fix.." This reverts commit b09ec459afb855c533d47fd7e8a1a6b1a09466e7. * Revert "fix.." This reverts commit 290bec6b83a4554ca232d531a911edccf94a2156. * fix esim * add more acc table. 10kph * kg update.. * fix cruisebutton mode3 * test atc..cond. * fix.. canfd * fix.. angle control limit 2025-06-13 15:59:36 +09:00			`# Kernel Creation`

			`Tinygrad lazily builds up a graph of Tensor operations. The Tensor graph includes a mix of:`

			- Buffer and Assignment Ops: `BUFFER`, `BUFFER_VIEW`, `COPY`, `ASSIGN`
			- Movement Ops: `RESHAPE`, `EXPAND`, `PERMUTE`, `PAD`, `SHRINK`, `FLIP`
			- Compute Ops: `ADD`, `MUL`, `REDUCE_AXIS`, ...

			`Tensor.kernelize` creates the kernels and buffers needed to realize the output Tensor(s).

			`## Kernelize flow`

			`Let's see how a multiply add Tensor graph becomes a fused elementwise kernel.`

			```py
			`# initialize 3 input buffers on the device`
			`a = Tensor([1]).realize()`
			`b = Tensor([2]).realize()`
			`c = Tensor([3]).realize()`

			`# create the Tensor graph`
			`mul = a*b`
			`out = mul+c`

			`print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>`
			`print(out) # <Tensor <UOp METAL (1,) int (<Ops.ADD: 52>, None)> on METAL with grad None>`

			`out.kernelize()`

			`print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>`
			`print(out) # <Tensor <UOp METAL (1,) int (<Ops.ASSIGN: 66>, None)> on METAL with grad None>`
			```

			`The multiply Tensor stays the same because it is fused. The output Tensor's UOp becomes a new ASSIGN UOp:`

			```py
KerryGold Model, AGNOS12.4, AdjustLaneChange, EnglighSound (#182) * Vegetarian Filet o Fish model * fix.. atc.. * test cluster_speed_limit * fix.. cluster_speed_limit.. 2 * fix.. clusterspeedlimit3 * cruise speed to roadlimit speed * fix.. * fix.. eng * deltaUp/Down for lanechange * fix.. atc desire... * fix.. * ff * ff * fix.. * fix.. eng * fix engsound * Update desire_helper.py * fix.. connect... * fix curve_min speed * Revert "fix curve_min speed" This reverts commit fcc9c2eb14eb3504abef3e420db93e8882e56f37. * Reapply "fix curve_min speed" This reverts commit 2d2bba476c58a7b4e13bac3c3ad0e4694c95515d. * fix.. auto speed up.. roadlimit * fix.. atc auto lanechange... * Update desire_helper.py * Update cruise.py * debug atc... * fix.. waze alert offset.. * fix.. * test atc.. * fix.. * fix.. atc * atc test.. * fix.. atc * fix.. atc2 * fix.. atc3 * KerryGold Model. latsmooth_sec = 0.0 * lat smooth seconds 0.13 * fix comment * fix.. auto cruise, and speed unit * change lanemode switching. * erase mazda lkas button. 2025-06-22 10:51:42 +09:00			`print(out.uop)`
KerryGoldModel, AGNOS12.3, ButtonMode3, autoDetectLFA2, (#181) * fix.. speed_limit error... * draw tpms settings. * fix.. traffic light stopping only.. * fix.. waze cam * fix.. waze... * add setting (Enable comma connect ) * auto detect LFA2 * fix.. cruisespeed1 * vff2 driving model. * fix.. * agnos 12.3 * fix.. * ff * ff * test * ff * fix.. drawTurnInfo.. * Update drive_helpers.py * fix.. support eng voice eng sounds fix settings... english fix.. mph.. fix.. roadlimit speed bug.. * new vff model.. 250608 * fix soundd.. * fix safe exit speed.. * fix.. sounds. * fix.. radar timeStep.. * KerryGold model * Update drive_helpers.py * fix.. model. * fix.. * fix.. * Revert "fix.." This reverts commit b09ec459afb855c533d47fd7e8a1a6b1a09466e7. * Revert "fix.." This reverts commit 290bec6b83a4554ca232d531a911edccf94a2156. * fix esim * add more acc table. 10kph * kg update.. * fix cruisebutton mode3 * test atc..cond. * fix.. canfd * fix.. angle control limit 2025-06-13 15:59:36 +09:00			```

			`The first source is the output BUFFER:`

			```
			`UOp(Ops.BUFFER, dtypes.int, arg=1, src=(`
			`UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),`
			`UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),))`
			```

			`And the second source is the KERNEL and its 4 buffer edges (output_buffer, a, b, c):`

			```
			`UOp(Ops.KERNEL, dtypes.void, arg=<Kernel 12 SINK(<Ops.STORE: 45>,) (__add__, __mul__)>, src=(`
			`UOp(Ops.BUFFER, dtypes.int, arg=1, src=(`
			`x1:=UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),`
			`UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),)),`
			`UOp(Ops.BUFFER, dtypes.int, arg=1, src=(`
			`x1,`
			`UOp(Ops.UNIQUE, dtypes.void, arg=1, src=()),)),`
			`UOp(Ops.BUFFER, dtypes.int, arg=1, src=(`
			`x1,`
			`UOp(Ops.UNIQUE, dtypes.void, arg=3, src=()),)),`
			`UOp(Ops.BUFFER, dtypes.int, arg=1, src=(`
			`x1,`
			`UOp(Ops.UNIQUE, dtypes.void, arg=5, src=()),)),))`
			```

			`KERNEL describes the compute AST, metadata and memory dependencies.`

			`BUFFER holds a reference to the device memory where the output will be stored.`

			`Once a Tensor is kernelized, all children will LOAD its BUFFER, instead of fusing it:`

			```py
			`child = out+2`
			`child.kernelize()`
KerryGold Model, AGNOS12.4, AdjustLaneChange, EnglighSound (#182) * Vegetarian Filet o Fish model * fix.. atc.. * test cluster_speed_limit * fix.. cluster_speed_limit.. 2 * fix.. clusterspeedlimit3 * cruise speed to roadlimit speed * fix.. * fix.. eng * deltaUp/Down for lanechange * fix.. atc desire... * fix.. * ff * ff * fix.. * fix.. eng * fix engsound * Update desire_helper.py * fix.. connect... * fix curve_min speed * Revert "fix curve_min speed" This reverts commit fcc9c2eb14eb3504abef3e420db93e8882e56f37. * Reapply "fix curve_min speed" This reverts commit 2d2bba476c58a7b4e13bac3c3ad0e4694c95515d. * fix.. auto speed up.. roadlimit * fix.. atc auto lanechange... * Update desire_helper.py * Update cruise.py * debug atc... * fix.. waze alert offset.. * fix.. * test atc.. * fix.. * fix.. atc * atc test.. * fix.. atc * fix.. atc2 * fix.. atc3 * KerryGold Model. latsmooth_sec = 0.0 * lat smooth seconds 0.13 * fix comment * fix.. auto cruise, and speed unit * change lanemode switching. * erase mazda lkas button. 2025-06-22 10:51:42 +09:00			`print(child.uop.src[1].arg.ast)`
KerryGoldModel, AGNOS12.3, ButtonMode3, autoDetectLFA2, (#181) * fix.. speed_limit error... * draw tpms settings. * fix.. traffic light stopping only.. * fix.. waze cam * fix.. waze... * add setting (Enable comma connect ) * auto detect LFA2 * fix.. cruisespeed1 * vff2 driving model. * fix.. * agnos 12.3 * fix.. * ff * ff * test * ff * fix.. drawTurnInfo.. * Update drive_helpers.py * fix.. support eng voice eng sounds fix settings... english fix.. mph.. fix.. roadlimit speed bug.. * new vff model.. 250608 * fix soundd.. * fix safe exit speed.. * fix.. sounds. * fix.. radar timeStep.. * KerryGold model * Update drive_helpers.py * fix.. model. * fix.. * fix.. * Revert "fix.." This reverts commit b09ec459afb855c533d47fd7e8a1a6b1a09466e7. * Revert "fix.." This reverts commit 290bec6b83a4554ca232d531a911edccf94a2156. * fix esim * add more acc table. 10kph * kg update.. * fix cruisebutton mode3 * test atc..cond. * fix.. canfd * fix.. angle control limit 2025-06-13 15:59:36 +09:00			```

			```
			`UOp(Ops.SINK, dtypes.void, arg=None, src=(`
			`UOp(Ops.STORE, dtypes.void, arg=None, src=(`
			`UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=0, src=()),`
			`x2:=UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(1,), strides=(0,), offset=0, mask=None, contiguous=True),)), src=()),`
			`UOp(Ops.ADD, dtypes.int, arg=None, src=(`
			`UOp(Ops.LOAD, dtypes.int, arg=None, src=(`
			`UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=1, src=()),`
			`x2,)),`
			`UOp(Ops.CONST, dtypes.int, arg=2, src=(`
			`x2,)),)),)),))`
			```

			`Tensor.realize` will execute the kernels and write outputs to memory:

			```py
			`Tensor.realize(out)`
			`print(out) # <Tensor <UOp METAL (1,) int (<Ops.BUFFER: 23>, <buf real:True device:METAL size:1 dtype:dtypes.int offset:0>)> on METAL with grad None>`
			`print(out.item()) # 5`
			```

			`<hr />`

			`Summary`

			`- The large Tensor graph is built from a mix of data, compute and movement Ops.`

			- `Tensor.kernelize` splits the Tensor graph into data (BUFFER), compute (KERNEL) and links dependencies with ASSIGN.

			- `Tensor.realize` executes KERNELs on device and replaces the Tensor graph with just a BUFFER.

			- Kernelize can be called multiple times on a Tensor. This allows for incrementally building the kernel fusion layout of a large Tensor graph, without having to call `realize` or `schedule`.