clash-prelude-1.8.1: Clash: a functional hardware description language - Prelude library
Copyright(C) 2013-2016 University of Twente
2016-2017 Myrtle Software Ltd
2017 Google Inc.
2021-2023 QBayLogic B.V.
2022 Google Inc.
LicenseBSD2 (see the file LICENSE)
MaintainerQBayLogic B.V. <devops@qbaylogic.com>
Safe HaskellTrustworthy
LanguageHaskell2010
Extensions
  • Cpp
  • MonoLocalBinds
  • TemplateHaskellQuotes
  • QuasiQuotes
  • ScopedTypeVariables
  • BangPatterns
  • ViewPatterns
  • GADTs
  • GADTSyntax
  • DataKinds
  • InstanceSigs
  • StandaloneDeriving
  • DeriveDataTypeable
  • DeriveFunctor
  • DeriveTraversable
  • DeriveFoldable
  • DeriveGeneric
  • DefaultSignatures
  • DeriveAnyClass
  • DeriveLift
  • DerivingStrategies
  • MagicHash
  • KindSignatures
  • PostfixOperators
  • TupleSections
  • TypeOperators
  • ExplicitNamespaces
  • ExplicitForAll
  • BinaryLiterals
  • TypeApplications

Clash.Explicit.BlockRam

Description

Block RAM primitives

Using RAMs

We will show a rather elaborate example on how you can, and why you might want to use block RAMs. We will build a "small" CPU + Memory + Program ROM where we will slowly evolve to using block RAMs. Note that the code is not meant as a de-facto standard on how to do CPU design in Clash.

We start with the definition of the Instructions, Register names and machine codes:

{-# LANGUAGE RecordWildCards, TupleSections, DeriveAnyClass #-}

module CPU where

import Clash.Explicit.Prelude

type InstrAddr = Unsigned 8
type MemAddr   = Unsigned 5
type Value     = Signed 8

data Instruction
  = Compute Operator Reg Reg Reg
  | Branch Reg Value
  | Jump Value
  | Load MemAddr Reg
  | Store Reg MemAddr
  | Nop
  deriving (Eq, Show, Generic, NFDataX)

data Reg
  = Zero
  | PC
  | RegA
  | RegB
  | RegC
  | RegD
  | RegE
  deriving (Eq, Show, Enum, Generic, NFDataX)

data Operator = Add | Sub | Incr | Imm | CmpGt
  deriving (Eq, Show, Generic, NFDataX)

data MachCode
  = MachCode
  { inputX  :: Reg
  , inputY  :: Reg
  , result  :: Reg
  , aluCode :: Operator
  , ldReg   :: Reg
  , rdAddr  :: MemAddr
  , wrAddrM :: Maybe MemAddr
  , jmpM    :: Maybe Value
  }

nullCode =
  MachCode
    { inputX = Zero
    , inputY = Zero
    , result = Zero
    , aluCode = Imm
    , ldReg = Zero
    , rdAddr = 0
    , wrAddrM = Nothing
    , jmpM = Nothing
    }

Next we define the CPU and its ALU:

cpu
  :: Vec 7 Value          -- ^ Register bank
  -> (Value,Instruction)  -- ^ (Memory output, Current instruction)
  -> ( Vec 7 Value
     , (MemAddr, Maybe (MemAddr,Value), InstrAddr)
     )
cpu regbank (memOut, instr) =
  (regbank', (rdAddr, (,aluOut) <$> wrAddrM, bitCoerce ipntr))
 where
  -- Current instruction pointer
  ipntr = regbank !! PC

  -- Decoder
  (MachCode {..}) = case instr of
    Compute op rx ry res -> nullCode {inputX=rx,inputY=ry,result=res,aluCode=op}
    Branch cr a          -> nullCode {inputX=cr,jmpM=Just a}
    Jump a               -> nullCode {aluCode=Incr,jmpM=Just a}
    Load a r             -> nullCode {ldReg=r,rdAddr=a}
    Store r a            -> nullCode {inputX=r,wrAddrM=Just a}
    Nop                  -> nullCode

  -- ALU
  regX   = regbank !! inputX
  regY   = regbank !! inputY
  aluOut = alu aluCode regX regY

  -- next instruction
  nextPC =
    case jmpM of
      Just a | aluOut /= 0 -> ipntr + a
      _                    -> ipntr + 1

  -- update registers
  regbank' = replace Zero   0
           $ replace PC     nextPC
           $ replace result aluOut
           $ replace ldReg  memOut
           $ regbank

alu Add   x y = x + y
alu Sub   x y = x - y
alu Incr  x _ = x + 1
alu Imm   x _ = x
alu CmpGt x y = if x > y then 1 else 0

We initially create a memory out of simple registers:

dataMem
  :: KnownDomain dom
  => Clock dom
  -> Reset dom
  -> Enable dom
  -> Signal dom MemAddr
  -- ^ Read address
  -> Signal dom (Maybe (MemAddr,Value))
  -- ^ (write address, data in)
  -> Signal dom Value
  -- ^ data out
dataMem clk rst en rd wrM =
  mealy clk rst en dataMemT (replicate d32 0) (bundle (rd,wrM))
 where
  dataMemT mem (rd,wrM) = (mem',dout)
    where
      dout = mem !! rd
      mem' =
        case wrM of
          Just (wr,din) -> replace wr din mem
          _             -> mem

And then connect everything:

system
  :: ( KnownDomain dom
     , KnownNat n )
  => Vec n Instruction
  -> Clock dom
  -> Reset dom
  -> Enable dom
  -> Signal dom Value
system instrs clk rst en = memOut
 where
  memOut = dataMem clk rst en rdAddr dout
  (rdAddr,dout,ipntr) = mealyB clk rst en cpu (replicate d7 0) (memOut,instr)
  instr  = asyncRom instrs <$> ipntr

Create a simple program that calculates the GCD of 4 and 6:

-- Compute GCD of 4 and 6
prog = -- 0 := 4
       Compute Incr Zero RegA RegA :>
       replicate d3 (Compute Incr RegA Zero RegA) ++
       Store RegA 0 :>
       -- 1 := 6
       Compute Incr Zero RegA RegA :>
       replicate d5 (Compute Incr RegA Zero RegA) ++
       Store RegA 1 :>
       -- A := 4
       Load 0 RegA :>
       -- B := 6
       Load 1 RegB :>
       -- start
       Compute CmpGt RegA RegB RegC :>
       Branch RegC 4 :>
       Compute CmpGt RegB RegA RegC :>
       Branch RegC 4 :>
       Jump 5 :>
       -- (a > b)
       Compute Sub RegA RegB RegA :>
       Jump (-6) :>
       -- (b > a)
       Compute Sub RegB RegA RegB :>
       Jump (-8) :>
       -- end
       Store RegA 2 :>
       Load 2 RegC :>
       Nil

And test our system:

>>> sampleN 32 $ system prog systemClockGen resetGen enableGen
[0,0,0,0,0,0,4,4,4,4,4,4,4,4,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,2]

to see that our system indeed calculates that the GCD of 6 and 4 is 2.

Improvement 1: using asyncRam

As you can see, it's fairly straightforward to build a memory using registers and read (!!) and write (replace) logic. This might however not result in the most efficient hardware structure, especially when building an ASIC.

Instead it is preferable to use the asyncRam function which has the potential to be translated to a more efficient structure:

system2
  :: ( KnownDomain dom
     , KnownNat n )
  => Vec n Instruction
  -> Clock dom
  -> Reset dom
  -> Enable dom
  -> Signal dom Value
system2 instrs clk rst en = memOut
 where
  memOut = asyncRam clk clk en d32 rdAddr dout
  (rdAddr,dout,ipntr) = mealyB clk rst en cpu (replicate d7 0) (memOut,instr)
  instr  = asyncRom instrs <$> ipntr

Again, we can simulate our system and see that it works. This time however, we need to disregard the first few output samples, because the initial content of an asyncRam is undefined, and consequently, the first few output samples are also undefined. We use the utility function printX to conveniently filter out the undefinedness and replace it with the string "undefined" in the first few leading outputs.

>>> printX $ sampleN 32 $ system2 prog systemClockGen resetGen enableGen
[undefined,undefined,undefined,undefined,undefined,undefined,4,4,4,4,4,4,4,4,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,2]

Improvement 2: using blockRam

Finally we get to using blockRam. On FPGAs, asyncRam will be implemented in terms of LUTs, and therefore take up logic resources. FPGAs also have large(r) memory structures called block RAMs, which are preferred, especially as the memories we need for our application get bigger. The blockRam function will be translated to such a block RAM.

One important aspect of block RAMs is that they have a synchronous read port, meaning unlike an asyncRam, the result of a read command given at time t is output at time t + 1.

For us that means we need to change the design of our CPU. Right now, upon a load instruction we generate a read address for the memory, and the value at that read address is immediately available to be put in the register bank. We will be using a block RAM, so the value is delayed until the next cycle. Thus, we will also need to delay the register address to which the memory address is loaded:

cpu2
  :: (Vec 7 Value,Reg)    -- ^ (Register bank, Load reg addr)
  -> (Value,Instruction)  -- ^ (Memory output, Current instruction)
  -> ( (Vec 7 Value, Reg)
     , (MemAddr, Maybe (MemAddr,Value), InstrAddr)
     )
cpu2 (regbank, ldRegD) (memOut, instr) =
  ((regbank', ldRegD'), (rdAddr, (,aluOut) <$> wrAddrM, bitCoerce ipntr))
 where
  -- Current instruction pointer
  ipntr = regbank !! PC

  -- Decoder
  (MachCode {..}) = case instr of
    Compute op rx ry res -> nullCode {inputX=rx,inputY=ry,result=res,aluCode=op}
    Branch cr a          -> nullCode {inputX=cr,jmpM=Just a}
    Jump a               -> nullCode {aluCode=Incr,jmpM=Just a}
    Load a r             -> nullCode {ldReg=r,rdAddr=a}
    Store r a            -> nullCode {inputX=r,wrAddrM=Just a}
    Nop                  -> nullCode

  -- ALU
  regX   = regbank !! inputX
  regY   = regbank !! inputY
  aluOut = alu aluCode regX regY

  -- next instruction
  nextPC =
    case jmpM of
      Just a | aluOut /= 0 -> ipntr + a
      _                    -> ipntr + 1

  -- update registers
  ldRegD'  = ldReg  -- Delay the ldReg by 1 cycle
  regbank' = replace Zero   0
           $ replace PC     nextPC
           $ replace result aluOut
           $ replace ldRegD memOut
           $ regbank

We can now finally instantiate our system with a blockRam:

system3
  :: ( KnownDomain dom
     , KnownNat n )
  => Vec n Instruction
  -> Clock dom
  -> Reset dom
  -> Enable dom
  -> Signal dom Value
system3 instrs clk rst en = memOut
 where
  memOut = blockRam clk en (replicate d32 0) rdAddr dout
  (rdAddr,dout,ipntr) = mealyB clk rst en cpu2 ((replicate d7 0),Zero) (memOut,instr)
  instr  = asyncRom instrs <$> ipntr

We are, however, not done. We will also need to update our program. The reason being that values that we try to load in our registers won't be loaded into the register until the next cycle. This is a problem when the next instruction immediately depends on this memory value. In our example, this was only the case when we loaded the value 6, which was stored at address 1, into RegB. Our updated program is thus:

prog2 = -- 0 := 4
       Compute Incr Zero RegA RegA :>
       replicate d3 (Compute Incr RegA Zero RegA) ++
       Store RegA 0 :>
       -- 1 := 6
       Compute Incr Zero RegA RegA :>
       replicate d5 (Compute Incr RegA Zero RegA) ++
       Store RegA 1 :>
       -- A := 4
       Load 0 RegA :>
       -- B := 6
       Load 1 RegB :>
       Nop :> -- Extra NOP
       -- start
       Compute CmpGt RegA RegB RegC :>
       Branch RegC 4 :>
       Compute CmpGt RegB RegA RegC :>
       Branch RegC 4 :>
       Jump 5 :>
       -- (a > b)
       Compute Sub RegA RegB RegA :>
       Jump (-6) :>
       -- (b > a)
       Compute Sub RegB RegA RegB :>
       Jump (-8) :>
       -- end
       Store RegA 2 :>
       Load 2 RegC :>
       Nil

When we simulate our system we see that it works. This time again, we need to disregard the first sample, because the initial output of a blockRam is undefined. We use the utility function printX to conveniently filter out the undefinedness and replace it with the string "undefined".

>>> printX $ sampleN 34 $ system3 prog2 systemClockGen resetGen enableGen
[undefined,0,0,0,0,0,0,4,4,4,4,4,4,4,4,6,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,2]

This concludes the short introduction to using blockRam.

Synopsis

Block RAM synchronized to an arbitrary clock

blockRam Source #

Arguments

:: (KnownDomain dom, HasCallStack, NFDataX a, Enum addr, NFDataX addr) 
=> Clock dom

Clock to synchronize to

-> Enable dom

Enable line

-> Vec n a

Initial content of the BRAM, also determines the size, n, of the BRAM

NB: MUST be a constant

-> Signal dom addr

Read address r

-> Signal dom (Maybe (addr, a))

(write address w, value to write)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

Create a block RAM with space for n elements

  • NB: Read value is delayed by 1 cycle
  • NB: Initial output value is undefined, reading it will throw an XException

See also:

  • See Clash.Explicit.BlockRam for more information on how to use a block RAM.
  • Use the adapter readNew for obtaining write-before-read semantics like this: readNew clk rst en (blockRam clk inits) rd wrM.
  • A large Vec for the initial content may be too inefficient, depending on how it is constructed. See blockRamFile and blockRamBlob for different approaches that scale well.

Example

Expand
bram40
  :: Clock  dom
  -> Enable  dom
  -> Signal dom (Unsigned 6)
  -> Signal dom (Maybe (Unsigned 6, Bit))
  -> Signal dom Bit
bram40 clk en = blockRam clk en (replicate d40 1)

blockRamPow2 Source #

Arguments

:: (KnownDomain dom, HasCallStack, NFDataX a, KnownNat n) 
=> Clock dom

Clock to synchronize to

-> Enable dom

Enable line

-> Vec (2 ^ n) a

Initial content of the BRAM

NB: MUST be a constant

-> Signal dom (Unsigned n)

Read address r

-> Signal dom (Maybe (Unsigned n, a))

(Write address w, value to write)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

Create a block RAM with space for 2^n elements

  • NB: Read value is delayed by 1 cycle
  • NB: Initial output value is undefined, reading it will throw an XException

See also:

Example

Expand
bram32
  :: Clock dom
  -> Enable dom
  -> Signal dom (Unsigned 5)
  -> Signal dom (Maybe (Unsigned 5, Bit))
  -> Signal dom Bit
bram32 clk en = blockRamPow2 clk en (replicate d32 1)

blockRamU Source #

Arguments

:: forall n dom a r addr. (KnownDomain dom, HasCallStack, NFDataX a, Enum addr, NFDataX addr, 1 <= n) 
=> Clock dom

Clock to synchronize to

-> Reset dom

Reset line. This needs to be asserted for at least n cycles in order for the BRAM to be reset to its initial state.

-> Enable dom

Enable line

-> ResetStrategy r

Whether to clear BRAM on asserted reset (ClearOnReset) or not (NoClearOnReset). The reset needs to be asserted for at least n cycles to clear the BRAM.

-> SNat n

Number of elements in BRAM

-> (Index n -> a)

If applicable (see ResetStrategy argument), reset BRAM using this function

-> Signal dom addr

Read address r

-> Signal dom (Maybe (addr, a))

(write address w, value to write)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

A version of blockRam that has no default values set. May be cleared to an arbitrary state using a reset function.

blockRam1 Source #

Arguments

:: forall n dom a r addr. (KnownDomain dom, HasCallStack, NFDataX a, Enum addr, NFDataX addr, 1 <= n) 
=> Clock dom

Clock to synchronize to

-> Reset dom

Reset line. This needs to be asserted for at least n cycles in order for the BRAM to be reset to its initial state.

-> Enable dom

Enable line

-> ResetStrategy r

Whether to clear BRAM on asserted reset (ClearOnReset) or not (NoClearOnReset). The reset needs to be asserted for at least n cycles to clear the BRAM.

-> SNat n

Number of elements in BRAM

-> a

Initial content of the BRAM (replicated n times)

-> Signal dom addr

Read address r

-> Signal dom (Maybe (addr, a))

(write address w, value to write)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

A version of blockRam that is initialized with the same value on all memory positions

Read/write conflict resolution

readNew Source #

Arguments

:: (KnownDomain dom, NFDataX a, Eq addr) 
=> Clock dom 
-> Reset dom 
-> Enable dom 
-> (Signal dom addr -> Signal dom (Maybe (addr, a)) -> Signal dom a)

The BRAM component

-> Signal dom addr

Read address r

-> Signal dom (Maybe (addr, a))

(Write address w, value to write)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

Create a read-after-write block RAM from a read-before-write one

True dual-port block RAM

A true dual-port block RAM has two fully independent, fully functional access ports: port A and port B. Either port can do both RAM reads and writes. These two ports can even be on distinct clock domains, but the memory itself is shared between the ports. This also makes a true dual-port block RAM suitable as a component in a domain crossing circuit (but it needs additional logic for it to be safe, see e.g. asyncFIFOSynchronizer).

A version with implicit clocks can be found in Clash.Prelude.BlockRam.

trueDualPortBlockRam Source #

Arguments

:: forall nAddrs domA domB a. (HasCallStack, KnownNat nAddrs, KnownDomain domA, KnownDomain domB, NFDataX a) 
=> Clock domA

Clock for port A

-> Clock domB

Clock for port B

-> Signal domA (RamOp nAddrs a)

RAM operation for port A

-> Signal domB (RamOp nAddrs a)

RAM operation for port B

-> (Signal domA a, Signal domB a)

Outputs data on next cycle. When writing, the data written will be echoed. When reading, the read data is returned.

Produces vendor-agnostic HDL that will be inferred as a true dual-port block RAM

Any value that is being written on a particular port is also the value that will be read on that port, i.e. the same-port read/write behavior is: WriteFirst. For mixed-port read/write, when port A writes to the address port B reads from, the output of port B is undefined, and vice versa.

data RamOp n a Source #

Port operation

Constructors

RamRead (Index n)

Read from address

RamWrite (Index n) a

Write data to address

RamNoOp

No operation

Instances

Instances details
Show a => Show (RamOp n a) Source # 
Instance details

Defined in Clash.Explicit.BlockRam

Methods

showsPrec :: Int -> RamOp n a -> ShowS Source #

show :: RamOp n a -> String Source #

showList :: [RamOp n a] -> ShowS Source #

Generic (RamOp n a) Source # 
Instance details

Defined in Clash.Explicit.BlockRam

Associated Types

type Rep (RamOp n a) :: Type -> Type Source #

Methods

from :: RamOp n a -> Rep (RamOp n a) x Source #

to :: Rep (RamOp n a) x -> RamOp n a Source #

NFDataX a => NFDataX (RamOp n a) Source # 
Instance details

Defined in Clash.Explicit.BlockRam

Methods

deepErrorX :: String -> RamOp n a Source #

hasUndefined :: RamOp n a -> Bool Source #

ensureSpine :: RamOp n a -> RamOp n a Source #

rnfX :: RamOp n a -> () Source #

type Rep (RamOp n a) Source # 
Instance details

Defined in Clash.Explicit.BlockRam

type Rep (RamOp n a) = D1 ('MetaData "RamOp" "Clash.Explicit.BlockRam" "clash-prelude-1.8.1-inplace" 'False) (C1 ('MetaCons "RamRead" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 (Index n))) :+: (C1 ('MetaCons "RamWrite" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 (Index n)) :*: S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 a)) :+: C1 ('MetaCons "RamNoOp" 'PrefixI 'False) (U1 :: Type -> Type)))

Internal

blockRam# Source #

Arguments

:: forall dom a n. (KnownDomain dom, HasCallStack, NFDataX a) 
=> Clock dom

Clock to synchronize to

-> Enable dom

Enable line

-> Vec n a

Initial content of the BRAM, also determines the size, n, of the BRAM

NB: MUST be a constant

-> Signal dom Int

Read address r

-> Signal dom Bool

Write enable

-> Signal dom Int

Write address w

-> Signal dom a

Value to write (at address w)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

blockRAM primitive

blockRamU# Source #

Arguments

:: forall n dom a. (KnownDomain dom, HasCallStack, NFDataX a) 
=> Clock dom

Clock to synchronize to

-> Enable dom

Enable line

-> SNat n

Number of elements in BRAM

-> Signal dom Int

Read address r

-> Signal dom Bool

Write enable

-> Signal dom Int

Write address w

-> Signal dom a

Value to write (at address w)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

blockRAMU primitive

blockRam1# Source #

Arguments

:: forall n dom a. (KnownDomain dom, HasCallStack, NFDataX a) 
=> Clock dom

Clock to synchronize to

-> Enable dom

Enable line

-> SNat n

Number of elements in BRAM

-> a

Initial content of the BRAM (replicated n times)

-> Signal dom Int

Read address r

-> Signal dom Bool

Write enable

-> Signal dom Int

Write address w

-> Signal dom a

Value to write (at address w)

-> Signal dom a

Value of the BRAM at address r from the previous clock cycle

blockRAM1 primitive

trueDualPortBlockRam# Source #

Arguments

:: forall nAddrs domA domB a. (HasCallStack, KnownNat nAddrs, KnownDomain domA, KnownDomain domB, NFDataX a) 
=> Clock domA 
-> Signal domA Bool

Enable

-> Signal domA Bool

Write enable

-> Signal domA (Index nAddrs)

Address

-> Signal domA a

Write data

-> Clock domB 
-> Signal domB Bool

Enable

-> Signal domB Bool

Write enable

-> Signal domB (Index nAddrs)

Address

-> Signal domB a

Write data

-> (Signal domA a, Signal domB a) 

Primitive for trueDualPortBlockRam