KerryGoldModel, AGNOS12.3, ButtonMode3, autoDetectLFA2, (#181 )

* fix.. speed_limit error...

* draw tpms settings.

* fix.. traffic light stopping only..

* fix.. waze cam

* fix.. waze...

* add setting (Enable comma connect )

* auto detect LFA2

* fix.. cruisespeed1

* vff2 driving model.

* fix..

* agnos 12.3

* fix..

* ff

* ff

* test

* ff

* fix.. drawTurnInfo..

* Update drive_helpers.py

* fix..

support eng  voice

eng sounds

fix settings... english

fix.. mph..

fix.. roadlimit speed bug..

* new vff model.. 250608

* fix soundd..

* fix safe exit speed..

* fix.. sounds.

* fix.. radar timeStep..

* KerryGold model

* Update drive_helpers.py

* fix.. model.

* fix..

* fix..

* Revert "fix.."

This reverts commit b09ec459afb855c533d47fd7e8a1a6b1a09466e7.

* Revert "fix.."

This reverts commit 290bec6b83a4554ca232d531a911edccf94a2156.

* fix esim

* add more acc table. 10kph

* kg update..

* fix cruisebutton mode3

* test atc..cond.

* fix.. canfd

* fix.. angle control limit

2025-06-13 15:59:36 +09:00

3.6 KiB

Raw Permalink Blame History

Kernel Creation

Tinygrad lazily builds up a graph of Tensor operations. The Tensor graph includes a mix of:

Buffer and Assignment Ops: BUFFER, BUFFER_VIEW, COPY, ASSIGN
Movement Ops: RESHAPE, EXPAND, PERMUTE, PAD, SHRINK, FLIP
Compute Ops: ADD, MUL, REDUCE_AXIS, ...

Tensor.kernelize creates the kernels and buffers needed to realize the output Tensor(s).

Kernelize flow

Let's see how a multiply add Tensor graph becomes a fused elementwise kernel.

# initialize 3 input buffers on the device
a = Tensor([1]).realize()
b = Tensor([2]).realize()
c = Tensor([3]).realize()

# create the Tensor graph
mul = a*b
out = mul+c

print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>
print(out) # <Tensor <UOp METAL (1,) int (<Ops.ADD: 52>, None)> on METAL with grad None>

out.kernelize()

print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>
print(out) # <Tensor <UOp METAL (1,) int (<Ops.ASSIGN: 66>, None)> on METAL with grad None>

The multiply Tensor stays the same because it is fused. The output Tensor's UOp becomes a new ASSIGN UOp:

print(out.lazydata)

The first source is the output BUFFER:

UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
  UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),
  UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),))

And the second source is the KERNEL and its 4 buffer edges (output_buffer, a, b, c):

UOp(Ops.KERNEL, dtypes.void, arg=<Kernel 12 SINK(<Ops.STORE: 45>,) (__add__, __mul__)>, src=(
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
    x1:=UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),
    UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),)),
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
     x1,
    UOp(Ops.UNIQUE, dtypes.void, arg=1, src=()),)),
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
     x1,
    UOp(Ops.UNIQUE, dtypes.void, arg=3, src=()),)),
  UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
     x1,
    UOp(Ops.UNIQUE, dtypes.void, arg=5, src=()),)),))

KERNEL describes the compute AST, metadata and memory dependencies.

BUFFER holds a reference to the device memory where the output will be stored.

Once a Tensor is kernelized, all children will LOAD its BUFFER, instead of fusing it:

child = out+2
child.kernelize()
print(child.lazydata.src[1].arg.ast)

UOp(Ops.SINK, dtypes.void, arg=None, src=(
  UOp(Ops.STORE, dtypes.void, arg=None, src=(
    UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=0, src=()),
    x2:=UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(1,), strides=(0,), offset=0, mask=None, contiguous=True),)), src=()),
    UOp(Ops.ADD, dtypes.int, arg=None, src=(
      UOp(Ops.LOAD, dtypes.int, arg=None, src=(
        UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=1, src=()),
         x2,)),
      UOp(Ops.CONST, dtypes.int, arg=2, src=(
         x2,)),)),)),))

Tensor.realize will execute the kernels and write outputs to memory:

Tensor.realize(out)
print(out)        # <Tensor <UOp METAL (1,) int (<Ops.BUFFER: 23>, <buf real:True device:METAL size:1 dtype:dtypes.int offset:0>)> on METAL with grad None>
print(out.item()) # 5

Summary

The large Tensor graph is built from a mix of data, compute and movement Ops.
Tensor.kernelize splits the Tensor graph into data (BUFFER), compute (KERNEL) and links dependencies with ASSIGN.
Tensor.realize executes KERNELs on device and replaces the Tensor graph with just a BUFFER.
Kernelize can be called multiple times on a Tensor. This allows for incrementally building the kernel fusion layout of a large Tensor graph, without having to call realize or schedule.

3.6 KiB Raw Permalink Blame History

Kernel Creation

Kernelize flow

3.6 KiB

Raw Permalink Blame History