Skip to content

Developer

The tinygrad framework has four pieces

  • a PyTorch like frontend.
  • a scheduler which breaks the compute into kernels.
  • a lowering engine which converts ASTs into code that can run on the accelerator.
  • an execution engine which can run that code.

Frontend¤

Everything in Tensor is syntactic sugar around function.py, where the forwards and backwards passes are implemented for the different mlops. There's about 25 of them, implemented using about 20 basic ops. Those basic ops go on to construct a graph of:

LazyBuffer ¤

LazyBuffer(
    device: str,
    st: ShapeTracker,
    dtype: DType,
    op: Optional[Op] = None,
    arg: Any = None,
    srcs: Tuple[LazyBuffer, ...] = (),
    base: Optional[LazyBuffer] = None,
)

The LazyBuffer graph specifies the compute in terms of low level tinygrad ops. Not all LazyBuffers will actually become realized. There's two types of LazyBuffers, base and view. base contains compute into a contiguous buffer, and view is a view (specified by a ShapeTracker). Inputs to a base can be either base or view, inputs to a view can only be a single base.

Scheduling¤

The scheduler converts the graph of LazyBuffers into a list of ScheduleItem. One ScheduleItem is one kernel on the GPU, and the scheduler is responsible for breaking the large compute graph into subgraphs that can fit in a kernel. ast specifies what compute to run, and bufs specifies what buffers to run it on.

ScheduleItem dataclass ¤

ScheduleItem(
    ast: Tuple[LazyOp, ...], bufs: Tuple[Buffer, ...]
)

inputs property ¤

inputs: Tuple[Buffer, ...]

Read only buffers in the schedule.

outputs property ¤

outputs: Tuple[Buffer, ...]

Read/write or write only buffers in the schedule.

Lowering¤

The code in realize lowers ScheduleItem to ExecItem with

lower_schedule ¤

lower_schedule(
    schedule: List[ScheduleItem],
) -> Generator[ExecItem, None, None]
Source code in tinygrad/engine/realize.py
181
182
def lower_schedule(schedule:List[ScheduleItem]) -> Generator[ExecItem, None, None]:
  while len(schedule): yield lower_schedule_item(schedule.pop(0))

There's a ton of complexity hidden behind this, see the codegen/ directory.

First we lower the AST to UOps, which is a linear list of the compute to be run. This is where the BEAM search happens.

Then we render the UOps into code with a Renderer, then we compile the code to binary with a Compiler.

Execution¤

Creating ExecItem, which has a run method

ExecItem dataclass ¤

ExecItem(prg: Runner, bufs: List[Optional[Buffer]])

bufs instance-attribute ¤

bufs: List[Optional[Buffer]]

prg instance-attribute ¤

prg: Runner

run ¤

run(
    var_vals: Optional[Dict[Variable, int]] = None,
    wait=False,
    jit=False,
    do_update_stats=True,
) -> Optional[float]
Source code in tinygrad/engine/realize.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
def run(self, var_vals:Optional[Dict[Variable, int]]=None, wait=False, jit=False, do_update_stats=True) -> Optional[float]:
  bufs = [cast(Buffer, x) for x in self.bufs] if jit else [cast(Buffer, x).ensure_allocated() for x in self.bufs]
  et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
  if do_update_stats:
    GlobalCounters.kernel_count += 1
    GlobalCounters.global_ops += (op_estimate:=sym_infer(self.prg.op_estimate, var_vals))
    GlobalCounters.global_mem += (mem_estimate:=sym_infer(self.prg.mem_estimate, var_vals))
    if et is not None: GlobalCounters.time_sum_s += et
    if DEBUG >= 2:
      ptm = (colored(f"{et*1e3:9.2f}ms", "yellow") if et > 0.01 else f"{et*1e6:9.2f}us") if et is not None else ""
      print(f"{colored(f'*** {self.prg.dname[:7]:7s} {GlobalCounters.kernel_count:4d}', 'magenta' if jit else ('green' if self.prg.first_run else None))} {self.prg.display_name+' '*(38-ansilen(self.prg.display_name))} arg {len(self.bufs):3d} mem {GlobalCounters.mem_used/1e9:5.2f} GB " +  # noqa: E501
            (str() if et is None else f"tm {ptm}/{GlobalCounters.time_sum_s*1e3:9.2f}ms ({op_estimate/((et or 1e-20)*1e9):8.2f} GFLOPS, {mem_estimate/((et or 1e-20)*1e9):7.2f} GB/s)"))  # noqa: E501
    self.prg.first_run = False
  return et

Lists of ExecItem can be condensed into a single ExecItem with the Graph API (rename to Queue?)