
* fix.. speed_limit error... * draw tpms settings. * fix.. traffic light stopping only.. * fix.. waze cam * fix.. waze... * add setting (Enable comma connect ) * auto detect LFA2 * fix.. cruisespeed1 * vff2 driving model. * fix.. * agnos 12.3 * fix.. * ff * ff * test * ff * fix.. drawTurnInfo.. * Update drive_helpers.py * fix.. support eng voice eng sounds fix settings... english fix.. mph.. fix.. roadlimit speed bug.. * new vff model.. 250608 * fix soundd.. * fix safe exit speed.. * fix.. sounds. * fix.. radar timeStep.. * KerryGold model * Update drive_helpers.py * fix.. model. * fix.. * fix.. * Revert "fix.." This reverts commit b09ec459afb855c533d47fd7e8a1a6b1a09466e7. * Revert "fix.." This reverts commit 290bec6b83a4554ca232d531a911edccf94a2156. * fix esim * add more acc table. 10kph * kg update.. * fix cruisebutton mode3 * test atc..cond. * fix.. canfd * fix.. angle control limit
3.6 KiB
Kernel Creation
Tinygrad lazily builds up a graph of Tensor operations. The Tensor graph includes a mix of:
- Buffer and Assignment Ops:
BUFFER
,BUFFER_VIEW
,COPY
,ASSIGN
- Movement Ops:
RESHAPE
,EXPAND
,PERMUTE
,PAD
,SHRINK
,FLIP
- Compute Ops:
ADD
,MUL
,REDUCE_AXIS
, ...
Tensor.kernelize
creates the kernels and buffers needed to realize the output Tensor(s).
Kernelize flow
Let's see how a multiply add Tensor graph becomes a fused elementwise kernel.
# initialize 3 input buffers on the device
a = Tensor([1]).realize()
b = Tensor([2]).realize()
c = Tensor([3]).realize()
# create the Tensor graph
mul = a*b
out = mul+c
print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>
print(out) # <Tensor <UOp METAL (1,) int (<Ops.ADD: 52>, None)> on METAL with grad None>
out.kernelize()
print(mul) # <Tensor <UOp METAL (1,) int (<Ops.MUL: 48>, None)> on METAL with grad None>
print(out) # <Tensor <UOp METAL (1,) int (<Ops.ASSIGN: 66>, None)> on METAL with grad None>
The multiply Tensor stays the same because it is fused. The output Tensor's UOp becomes a new ASSIGN UOp:
print(out.lazydata)
The first source is the output BUFFER:
UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),
UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),))
And the second source is the KERNEL and its 4 buffer edges (output_buffer, a, b, c):
UOp(Ops.KERNEL, dtypes.void, arg=<Kernel 12 SINK(<Ops.STORE: 45>,) (__add__, __mul__)>, src=(
UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
x1:=UOp(Ops.DEVICE, dtypes.void, arg='METAL', src=()),
UOp(Ops.UNIQUE, dtypes.void, arg=6, src=()),)),
UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
x1,
UOp(Ops.UNIQUE, dtypes.void, arg=1, src=()),)),
UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
x1,
UOp(Ops.UNIQUE, dtypes.void, arg=3, src=()),)),
UOp(Ops.BUFFER, dtypes.int, arg=1, src=(
x1,
UOp(Ops.UNIQUE, dtypes.void, arg=5, src=()),)),))
KERNEL describes the compute AST, metadata and memory dependencies.
BUFFER holds a reference to the device memory where the output will be stored.
Once a Tensor is kernelized, all children will LOAD its BUFFER, instead of fusing it:
child = out+2
child.kernelize()
print(child.lazydata.src[1].arg.ast)
UOp(Ops.SINK, dtypes.void, arg=None, src=(
UOp(Ops.STORE, dtypes.void, arg=None, src=(
UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=0, src=()),
x2:=UOp(Ops.VIEW, dtypes.void, arg=ShapeTracker(views=(View(shape=(1,), strides=(0,), offset=0, mask=None, contiguous=True),)), src=()),
UOp(Ops.ADD, dtypes.int, arg=None, src=(
UOp(Ops.LOAD, dtypes.int, arg=None, src=(
UOp(Ops.DEFINE_GLOBAL, dtypes.int.ptr(1), arg=1, src=()),
x2,)),
UOp(Ops.CONST, dtypes.int, arg=2, src=(
x2,)),)),)),))
Tensor.realize
will execute the kernels and write outputs to memory:
Tensor.realize(out)
print(out) # <Tensor <UOp METAL (1,) int (<Ops.BUFFER: 23>, <buf real:True device:METAL size:1 dtype:dtypes.int offset:0>)> on METAL with grad None>
print(out.item()) # 5
Summary
-
The large Tensor graph is built from a mix of data, compute and movement Ops.
-
Tensor.kernelize
splits the Tensor graph into data (BUFFER), compute (KERNEL) and links dependencies with ASSIGN. -
Tensor.realize
executes KERNELs on device and replaces the Tensor graph with just a BUFFER. -
Kernelize can be called multiple times on a Tensor. This allows for incrementally building the kernel fusion layout of a large Tensor graph, without having to call
realize
orschedule
.