Abstract—There is a semantic gap between the hardware definition languages used to design and implement hardware and the languages and logics used to formally specify and verify them. Bridging this gap—i.e., constructing formal models from existing hardware artifacts—can be costly, time-consuming, and error prone—and yet utterly necessary if formal verification is to proceed. This work demonstrates that this gap can be collapsed by starting in a pure functional language that is also a hardware description language, and that equational style verifications may be performed directly on the source text of a hardware design, thereby significantly lowering the verification cost for reconfigurable designs. When combined with an efficient compiler, this methodology achieves both good performance and low cost verification.

I. INTRODUCTION

Reconfigurable computing emphasizes a “mix and match” approach to system construction, frequently involving specially tailored “one off” components. Formal methods can provide high confidence that systems obey critical properties (e.g., safety and security), but, by reputation, they can also involve a substantial investment of time and effort. Formal methods may, therefore, seem somewhat antithetical to reconfigurable computing. Can it make economic sense to invest the resources for formal methods on potentially “one off” reconfigurable systems?

The proposed methodology aims to make hardware verification cost effective for reconfigurable designs via a functional programming language that also serves as a hardware description language. The principal hypothesis of this research is that following this methodology can significantly reduce the effort of verifying hardware designs, thereby making formal verification cost effective for reconfigurable computing. The functional language—ReWire [1]—plays a dual rôle for both hardware description and formal specification. We support this hypothesis with a demonstration of the approach in which the stream cipher Salsa20 [2] is implemented efficiently in ReWire and verified using equational reasoning on the implementation source code.

In the functional programming community, equational reasoning about programs frequently goes by the moniker “Bird-Wadler style” (so named for the influential textbook [3]). Functional programmers reason about source programs in an equational style, by replacing equals for equals, making simplifications, induction and coinduction, etc. Equational reasoning is commonly used to justify, among other things, source-to-source transformations and program correctness. This is precisely what we use Bird-Wadler reasoning for in this paper, although, in ReWire, programs are hardware descriptions.

This research demonstrates that formal methods and reconfigurable systems are not antithetical to one another at all. The contributions of this paper are as follows. (1) We describe a methodology for developing high assurance, reconfigurable systems leveraging pure functional languages and equational reasoning. A standard practice in functional programming—Bird-Wadler reasoning—is repurposed to hardware design with this methodology. (2) We introduce an extension to ReWire called Connect Logic, which consists of domain specific language abstractions for hardware devices that support a mixture of functional and structural design styles. (3) Encapsulation of a pipelining structuring technique in Connect Logic is exhibited along with (4) several performant implementations of the Salsa20 stream cipher based on it.

Reconfigurable Salsa20 without ReWire: Consider the following experiment. A hardware designer decides to implement the Salsa20 cipher in hardware. There are a number of good reasons to do so, not the least of which is that reconfigurable hardware can increase the possible throughput compared to a software implementation. The hardware designer uses a tried and true hardware definition language (HDL) like VHDL or Verilog. The implementation path is straightforward—she implements Bernstein’s defining equations [2] in terms of the HDL and performs her usual development process involving synthesis, simulation, and testing.

This first implementation is one step removed from Bernstein’s high-level specification, and, furthermore, is expressed in a language without a formal semantics. So, how does she prove that the first implementation is correct? It becomes clear to the hardware engineer that the first implementation does not suffice: even implemented in the most optimized fashion, it contains too many gates for most FPGAs. So, the hardware engineer produces a second implementation structured in an explicitly pipelined form resulting in a circuit that fits on her FPGA.

Is she all done? Not if formal proof is required that the second implementation is correct. The second implementation
is two steps removed from Bernstein’s high level specification and it is written in a language without a formal semantics. To verify its correctness, where does she even start? She could attempt to verify the implementation by encoding it in the logic of a theorem prover, but, observe that this involves yet another translation—and one which is not straightforward. With this approach, how can we be sure that her logical specification faithfully relates Bernstein’s high-level specification to a VHDL implementation?

**Bird-Wadler Provably Correct Development:** To illustrate the formal methodology we advocate for reconfigurable computing, consider first this classic example (p.131, [3]) of Bird-Wadler style equational reasoning in Fig. 1. On the left is the usual recursive definition of the Fibonacci function. It serves as a reference specification defining the meaning of the Fibonacci function, but it has terrible \(O(2^n)\) performance. The other version of the Fibonacci function on the right is in an optimized, “accumulator-passing style” form with \(O(n)\) performance.

The hallmark of Bird-Wadler development is that there is a reference specification (e.g., \(\text{fib}\)) and one or more transformations from it (e.g., into \(\text{fib2}\)) that give rise to an equational verification (e.g., the Fib theorem in Fig. 1). This verification justifies using the optimized version (i.e., replacing \(\text{fib}(n)\) with \(\text{fast}(\text{fib2}(n))\)).

**Provably Correct Development of Salsa20 with ReWire:** It is precisely the Bird-Wadler style of development that ReWire enables for reconfigurable computing. Fig. 4 presents the hash function from the Salsa20 stream cipher [2] represented in a Haskell-like syntax. We discuss this figure in some detail as well as explain the requisite Haskell syntax in subsequent sections. It suffices to say that Fig. 4 contains a functional program defining the Salsa20 hash function that also serves as the high-level reference specification in the Bird-Wadler development presented in our case study. To render it into a synthesizable form, we add some Connect Logic annotations to produce the ReWire code in Fig. 5. The ReWire compiler can now synthesize a circuit for Salsa20. This new implementation can now be measured in two ways: against standard performance metrics as in Table I or by verifying that it produces the same answers as the reference specification (Theorem 1).

The key difficulty was to go from informal prose vendor documentation, with its often-tantalising ambiguity, to a fully rigorous definition (mechanised in HOL) that one can be reasonably confident is an accurate reflection of the vendor architectures (Intel 64 and IA-32, and AMD64).”

Cryptol [6] is a domain-specific language for specifying, verifying and implementing cryptographic algorithms. Given a cryptographic algorithm, one can specify it in Cryptol, run a number of automatic and semi-automatic proof tools over the specification, and ultimately generate C code implementing the algorithm itself. The current open source version of Cryptol (v.2) does not generate hardware implementations, although a previous proprietary version (v.1) did. ReWire, by contrast, is a subset of Haskell compilable to VHDL and is not restricted to cryptographic algorithms. Salsa20 has been specified in Cryptol v.2, but no effort has been made to portback this specification to Cryptol v.1 and synthesize it.

The usual standards for evaluating hardware architectures and design flows are performance-based metrics (e.g., time and space performance, power usage, etc.). Within the context of mission critical systems, formal analysis and verification are required evaluation modes as well. The Common Criteria for Information Technology Security Evaluation (a.k.a. Common Criteria or CC) is an international standard (ISO/IEC 15408) for computer security certification and the US Federal government mandates following the CC requirements for mission critical systems. The CC sets seven evaluation assurance levels (EAL). The most stringent such level is EAL7, which requires “extensive formal analysis” for applications in “extremely high risk situations and/or where the high value of the assets justifies the higher costs” ensuing from formal verification [7]. For reconfigurable computing to be applied in the space of mission critical systems, cost effective formal methods techniques must be developed. The current research is a step in this direction.

Previous work demonstrated the construction and verification of a secure many-core system in ReWire [1]. The present work, in contrast, demonstrates the expression of a common hardware design pattern (stall-free pipelining) in ReWire and its verification. The emphasis in the former was on the design and implementation of the ReWire language, while the current work focuses on ReWire as a vehicle for hardware verification.
A. Pure Functional Languages & Equational Verification

1) Primer on Haskell/ReWire Syntax: For the sake of being as self-contained as possible, this section presents a quick overview of Haskell—and, hence, ReWire—syntax necessary to understand this paper.

Haskell [8] is a strongly-typed, purely functional language. A Haskell program consists of a number of function and datatype declarations. The type of a function from type \( a \) to type \( b \) is written \( a \rightarrow b \). The type for a tuple with first and second components \( a \) and \( b \), resp., is written \((a, b)\). The fact that a Haskell expression \( e \) has type \( a \) is written \( e :: a \). Haskell has a built-in list type constructor: \([\] \) is the type of all lists of elements of type \( a \). Because of Haskell’s lazy evaluation strategy, lists can have an infinite number of elements—such lists are also called *streams*.

Below are a number of function declarations. The simplest function is the identity function, which takes its argument and simply returns it. Given two functions, \( f \) and \( g \), their composition is written \( f \circ g \). Function application is written either \( g(x) \) or by simple juxtaposition, \( g \times x \). The function map takes two arguments, a function \( f \) and a list \( l \), and applies \( f \) to each element of \( l \), thereby creating a new list. The function drop takes a non-negative integer \( n \) and a list \( l \), and returns the list missing the first \( n \) elements from \( l \). Cons \((\)\) takes an item \( x_0 \) and a list of items and returns a new list with \( x_0 \) on the front. N.b., it is important to distinguish \( (\)\)—“has type”—from list cons \((\)\).

\[
\begin{align*}
\text{id} x & = x \\
(f \circ g) x & = f(g(x)) \\
\text{map} f [x_0, x_1, \ldots] & = [f(x_0), f(x_1), \ldots] \\
\text{drop} n [x_0, \ldots, x_{n-1}, x_n, \ldots] & = [x_0, \ldots] \\
\text{cons} \langle x_0, \ldots, x_n \rangle & = [x_0, \ldots, x_n] \\
\text{nth} [x_0, \ldots, x_n] & = x_n \\
\text{fst} (a, b) & = a \\
\text{snd} (a, b) & = b
\end{align*}
\]

We note without proof that, for two non negative integers, \( n \) and \( m \), it holds that:

\[
drop (n + m) l = \drop n (\drop m l) \tag{\dagger}
\]

In Haskell/ReWire, we can introduce new datatypes with the *data* keyword. In the following declarations, Quad and Hex are *type constructors* that, given any type \( a \), construct new types, Quad \( a \) and Hex \( a \), resp. To construct a value of a datatype, apply a *data constructor*; the data constructors below are \( Q \) and \( H \). For example, a value \( Q 1 2 3 4 \) is of type Quad Int; we write this type declaration as \( Q 1 2 3 4 :: \text{Quad Int} \). A Bit is either High or Low.

\[
\begin{align*}
data \; \text{Quad} \; a & = Q \; a \; a \; a \; a \\
data \; \text{Hex} \; a & = H \; a \; a \; a \; a \; a \; a \\
data \; \text{Bit} & = \text{High} \mid \text{Low}
\end{align*}
\]

ReWire has built-in types for words. A 32-bit (128-bit) word belongs to the type \( w32 \) \((w128)\). For example, a value of type \( \langle Q \; w1 \; w2 \; w3 \; w4 \rangle \) does nothing more than four 32-bit words.

2) Purity and Equational Verification: Haskell (and, hence, ReWire) is a pure language, which is a critical foundation for equational reasoning. Purity means that the type of a Haskell program faithfully represents its value and behavior. If a Haskell function has type \( \text{Int} \rightarrow \text{Int} \), then the function takes an Int as input and produces an Int as output. Furthermore, we can conclude that the function possesses no side effects whatsoever because, in Haskell, side effects are reflected accurately in the types. The expression \( \text{print}\; \text{"Hello World"} \) for instance, prints out \text{Hello World} to the prompt and, therefore, \( \text{print}\; \text{"Hello World"} :: \text{IO} \) —it produces the value \text{nil}, \( () \), which is tagged in its type with \text{IO}, meaning it performs input/output in some form.

To prove an equation, \( e = e' \), one starts from \( e \) and “replaces equals for equals” until \( e' \) is reached. In symbols, this proof is \( e = e_1 = e_2 = \cdots = e_n = e' \) in which each step is justified by a known equation \( x = y \)—as in “replace \( x \) in \( e_4 \) by \( y \) to obtain \( e_{4+1} \)”. Purity supports this style of reasoning because, being all Haskell expressions are side effect free, they cannot interact unpredictably with the expressions in which they are substituted.

B. Extending ReWire with Connect Logic

This section presents the ReWire operators for the compositional construction of devices from other devices. We refer to these particular operators as “Connect Logic”. Connect Logic enables two or more existing devices to be composed in parallel and connected together. Connect Logic supports a compositional style of hardware design akin to structural VHDL. Formulating the design of a hardware device may be accomplished as in previous work [1] (i.e., without Connect Logic), or, existing devices may be composed with Connect Logic operations into bigger devices.

There is a type constructor \( \text{Dev} \) for synchronous devices in ReWire. There are three basic architectural constructors that Connect Logic adds to the ReWire language. The first, \( \text{iter} \), constructs a synchronous device from a pure function from inputs to outputs. The second, \( \langle&\rangle \), composes two devices in parallel. The third, \( \text{refold} \), is a recursion operator that is used to interconnect devices and/or express feedback loops (i.e., feed back device outputs to inputs).
1) Types for Devices: There is one basic unit of Connect Logic, devices, for which we introduce the following type: \texttt{Dev} \texttt{i o} for any types \texttt{i} and \texttt{o}. A term of type, \texttt{Dev} \texttt{i o}, represents a clocked computation that, for each clock cycle, takes an input of type \texttt{i}, produces an output of type \texttt{o}, and may possess internal storage. We eschew the formal definition of \texttt{Dev} as it is unnecessary to understanding Connect Logic and its uses. Device \texttt{d} is clocked, as illustrated in the inset figure. The clock is represented by the underlying structure of \texttt{Dev} \texttt{i o}, rather than as an explicit parameter. A device is created in Connect Logic by either iterating a function or through composition of existing devices. We introduce operators for constructing devices and composing them into larger, interconnected devices. All Connect Logic operations are \texttt{constructors} for \texttt{Dev}, meaning that they are functions producing \texttt{Dev} \texttt{i o} values for some \texttt{i} and \texttt{o} types.

2) Iteration: The most basic Connect Logic constructor, \texttt{iter}, iterates a pure function of type \texttt{i} \to \texttt{o}, producing an output corresponding to the input at each clock cycle. The Haskell definition of \texttt{iter} is as follows:

\begin{verbatim}
iter :: (i -> o) -> o -> Dev i o
iter f o = do i <- signal o
            iter f i
\end{verbatim}

Fig. 2(a) illustrates the device created with the \texttt{iter} operation. The type declaration above means that \texttt{iter} is a device constructor that takes a function from inputs \texttt{i} to outputs \texttt{o} and an initial output value and constructs a corresponding device. The device (\texttt{iter f o}) will, at the first clock cycle, return output \texttt{o} and, in the next clock cycle after consuming an input \texttt{i}, will produce a new output, (\texttt{f i}). This pattern repeats recursively ad infinitum. The (\texttt{signal o}) operator outputs its argument \texttt{o} and returns the next input. The definition of the (\texttt{iter f o}) constructor above may be read as (1) \texttt{o} output (i.e., \texttt{signal o}), (2) receive the next input (i.e, \texttt{do i <- signal o}), and then (3) repeat the pattern with new “initial” output (\texttt{f i}).

3) Parallelism: Parallelism is expressed with the device constructor, \langle, \rangle, that composes two existing devices, \texttt{d1} and \texttt{d2}, into a single device, \texttt{d1} \langle \texttt{d2}\rangle, in which both devices operate in parallel and in isolation from one another. \texttt{N.b.}, we are assuming, here and elsewhere, that both arguments \texttt{d1} and \texttt{d2} are non-terminating. The type declaration of \langle, \rangle is:

\begin{verbatim}
\langle, \rangle :: Dev i_1 o_1 ->
    Dev i_2 o_2 ->
    Dev (i_1,i_2) (o_1,o_2)
\end{verbatim}

We omit its Haskell definition as doing so would require an unnecessary excursion into Haskell’s syntax and semantics. Fig. 2(b) presents a pictorial version of \texttt{d1} \langle \texttt{d2}\rangle. The type signature of \langle, \rangle means that the input and output types of constructed device \texttt{d1} \langle \texttt{d2}\rangle are pairs of the inputs and outputs of \texttt{d1} and \texttt{d2}, resp. Both subdevices \texttt{d1} and \texttt{d2} are isolated from one another in \texttt{d1} \langle \texttt{d2}\rangle—i.e., there is no intercommunication or shared state between them. Such interaction may be added explicitly using the \texttt{refold} operator described below. The parallelism operator may be generalized to arbitrary numbers of devices (i.e., beyond two), but, for lack of space, we only present the simplest case.

4) Interdevice Communication & Feedback: Making interconnections between devices occurs using another device level operator, \texttt{refold}. The \texttt{refold} operator can be used to connect sub-devices within its third argument and to hide internal connections as well. The use of \texttt{refold} is illustrated in Fig. 2(c). Given a device \texttt{d} :: \texttt{Dev i_1 o_1}, and two pure functions, \texttt{out :: o_1 -> o_2} and \texttt{conn :: (o_1 -> i_2 -> i_1)}, \texttt{refold out conn d} is a new device with the following behavior. Given an external input \texttt{i’} and current value output \texttt{o}, by internal device \texttt{d}, the new input to \texttt{d} is \texttt{conn o’ \texttt{i’}} and the new external output is \texttt{o’}. The type of \texttt{refold} is:

\begin{verbatim}
refold :: (o_1 -> o_2) ->
        (o_1 -> i_2 -> i_1) ->
        Dev i_1 o_1 ->
        Dev i_2 o_2
\end{verbatim}

5) Defining a Pipeline: The form of pipeline we consider is a simple one, namely stall-free pipelines, in which the output from a stage flows directly into the input of the next stage. It is possible to define more complex pipelines (e.g., instruction pipelines that stall, etc.) with Connect Logic, but we leave that subject for a follow-on publication.

Stall-free pipelines—henceforth simply “pipelines”—have the flavor of functional composition, and the architectural combiners of ReWire allow the formalization of this intuition. For functions, \texttt{f_j}, of appropriate type, the composition, \texttt{\circ \cdots \circ f_2 f_1}, resembles a pipeline. Of course, this ignores the timing aspect of a pipeline. In ReWire, we can express this pipeline, along with its timing, as the following:

\begin{verbatim}
iter f_1 o_1 \cdots \cdots iter f_n o_n
\end{verbatim}

where \texttt{f_j :: a_j -> a_{j+1}} are pure functions from input of type \texttt{a_j} to output of type \texttt{a_{j+1}} and each \texttt{o_j :: \texttt{a_{j+1}}} is the initial output value produced by pipeline stage \texttt{iter f_j o_j}. The \texttt{\cdots} combinator chains each stage together, connecting the
output of the $j$th stage to the input of the $j+1$th stage. The combinator for pipelining, etc., are defined below.

Note that $\sim$ is not syntactic sugar for function composition. For example, while it is true that $\text{id} \circ f = f$, it is also the case that $\text{iter id o1} \sim \text{iter f o2} \neq \text{iter f o2}$. The LHS of this inequality is a two stage pipeline while the RHS is a one stage pipeline. The outputs both pipelines produce will be related, of course.

Given two devices, $d_1$ and $d_2$, the ReWire code for connecting them in pipelined sequence is below. This construction is illustrated in Fig. 2(d). The two devices are first placed unconnected in parallel (i.e., $d_1 \prec \lhd d_2 \prec \text{Dev (a, b) (b, c)}$) and, in this context, both devices operate in isolation. The combined device consumes a single input of type $\langle a, b \rangle$ and produces a single output of type $\langle b, c \rangle$. The output type for $(d_1 \sim \prec d_2)$ is $\langle a \rangle$; i.e., the second component of the output tuple of $d_1 \prec \lhd d_2$. The external input (of type $\langle a \rangle$) to $(d_1 \sim \prec d_2)$ is passed to the subdevice $d_1$ and the output of $d_1$ to the input of $d_2$; thus the routing function pipe is as defined below:

\[
\quad \quad \quad \quad \quad \quad \sim \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \qu
B. Salsa20 Iterative Implementation

A diagrammatic view of the circuit produced is found in Fig. 8(a). Synthesis estimates of resource usage and FMax for s1s20dev are in Table I.

There is one functional unit performing the doubleround operation. This operates ten cycles to produce an answer. When the inputs to the device s1s20dev are \([\text{High}, n], (\text{Low}, n_0), \ldots, (\text{Low}, n_0)\ldots\), then, on the cycle with input \((\text{Low}, n_0)\), the output will be salsa20n. The High bit signifies that the device should start hashing \(n\). The \((\text{Low}, n')\) input signifies that \(n'\) should be ignored and that the iteration should continue.

C. Pipelining Salsa20

The numbers for the iterative device are reasonable, but the structure of the cipher algorithm would indicate that there is room for improvement. There is an apparent performance gap with this approach: nine cycles of the device do not yield useful output. Pipelining our base components together gives us a way to keep our performance characteristics with respect to clock speed roughly the same while enabling our device to be productive on every clock cycle. We do so by placing ten different passthru \((k)\) dblrd devices in sequence, connecting their inputs and outputs together to obtain pipe10 in Fig. 6.

A twenty stage pipeline may be created by increasing the granularity of each stage. Now, instead of staging each doubleround as before, each component columnround and rowround is staged (see Fig. 7).

V. EVALUATING PROVABLY CORRECT SALSA20 DEVICES

This section evaluates the devices created in the previous section according to two modes: performance and verification. The devices synthesized by the ReWire compiler exhibit performance comparable to a previously published, hand optimized design [11] We sketch the verification of general theorem which characterizes the correctness of the pipelining transformation applied in Section IV-C.

In this section, we sketch the verification of the pipelining transformation defined in Section IV. There is a function of the following type that serves to run a device on stream of inputs: feed :: [i] -> Dev io -> [o]. For a stream of inputs is :: [i] and a device d :: Dev io, feed is d is the stream of outputs created by running the device d on is. N.B., feed preserves the order of outputs with respect to inputs; i.e., if \(i\) is the \(n^{th}\) input in is, then the \((n + 1)^{st}\) item in feed is d was produced by d on \(i\). We omit the definition of feed.

A. Performance

We evaluated the performance of the VHDL generated from our high level specifications by synthesizing it using Xilinx ISE targeting a Kintex 7 FPGA (xc7k160t-3fbg676). The synthesis results detailed in Table I show an increase in throughput and resource utilization as we pipeline that is in line with intuitive expectations. The 10-stage pipeline and the iterative implementation are the same design core replicated tenfold. We observe a nearly tenfold increase of flip-flop usage and a notable increase in LUT usage (likely impacted by optimizations in the synthesis tools). In the 20-stage pipeline, we divide our basic unit into separate rowRound and columnRound pipeline stages. This introduces some additional LUT usage, but doubles flip-flop (slice) usage because the number of stages in the pipeline are doubled. The maximum frequency of the 20-stage pipeline increases by approximately 1.7 times which indicates a doubling effect from doubling the pipeline with a moderate amount of overhead. These numbers demonstrate that our approach is competitive with similar work in the area of synthesizing Salsa20 [11] on modern FPGAs.
B. Testing the Iterative Salsa20 Device Automatically

We used the QuickCheck tool [12] to test the putative correctness of the relationship between the reference specification salsa20 and the iterative ReWire definition s/s20dev (from Figs. 4 and 5, resp.). Below is a Bool-valued function, \texttt{test}, that takes a W128 nonce \texttt{n} as input and computes an equation. Note that the value of input stream \texttt{is} is of the form

\[ [(\text{High}, n), (\text{Low}, \text{undefined}), (\text{Low}, \text{undefined}), \cdots] \]

where \texttt{undefined} is a special “don’t care” constant built-in to Haskell.

\begin{verbatim}
test :: W128 \rightarrow\text{Bool}
test n = reference \text{==} \text{iterative}
where
  reference = salsa20 n
  iterative = \text{n\textit{th}} (feed is s/s20dev)
  is = (\text{High}, n) \text{: repeat} (\text{Low}, \text{undefined})
\end{verbatim}

QuickCheck can generate random inputs to test \texttt{test} and, if \texttt{test} returns \texttt{True} for each input, then QuickCheck remarks that the tests were passed; below is a transcript of running QuickCheck on this correctness condition for \texttt{s/s20dev}.

\begin{verbatim}
GHCi, version 7.10.1: http://www.haskell.org/
ghc/ :? for help
[1 of 1] Compiling Salsa20
    ( Salsa20.
hs, interpreted )
Ok, modules loaded: Salsa20.
*Salsa20> quickCheck test
+++ OK, passed 100 tests.

test n = reference == iterative
where
  reference = salsa20 n
  iterative = nth (feed is s/s20dev)
  is = (High, n) : repeat (Low, undefined)
\end{verbatim}

The correctness condition is neatly summed up in the following theorem (stated without proof):

**Theorem 1** (Correctness of Iterative Salsa20). For all nonces \texttt{n}, \texttt{n}0, \ldots, \texttt{n}n :: W128, assume input stream \texttt{is} has the form

\[ [(\text{High}, n), (\text{Low}, n)0, \cdots, (\text{Low}, n)n, \cdots] \]

Then, the following equation holds: \texttt{s/s20dev n = nth (feed is s/s20dev)}.

C. Verification of Pipelining

1) Lemmas: This section states the Lemmas used in proving the correctness of pipelining (Theorem 2 below). Each lemma is left unproven, although we describe the intuitive meaning of each.

Lemma 1 says that the pipelining operator is associative. The associativity of \texttt{\textasciitilde} allows for “parentheses to be dropped”: i.e., \( f \texttt{\textasciitilde} (g \texttt{\textasciitilde} h) \) can stand for either the right- or left-hand sides of the equation in the lemma.

**Lemma 1** (Associativity). The \texttt{\textasciitilde} operation is associative.

\[ f \texttt{\textasciitilde} (g \texttt{\textasciitilde} h) = (f \texttt{\textasciitilde} g) \texttt{\textasciitilde} h \]

<table>
<thead>
<tr>
<th></th>
<th>LUTs</th>
<th>Slices</th>
<th>Fmax (MHz)</th>
<th>T (Gbit/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Iterative</td>
<td>3459</td>
<td>651</td>
<td>99.4</td>
<td>5.1</td>
</tr>
<tr>
<td>10 Stage</td>
<td>22840</td>
<td>6019</td>
<td>97.5</td>
<td>49.9</td>
</tr>
<tr>
<td>20 Stage</td>
<td>25519</td>
<td>12309</td>
<td>167.4</td>
<td>85.7</td>
</tr>
</tbody>
</table>

**TABLE I**: Resource usage, Fmax, and throughput (T) of the Salsa20 algorithm as implemented and compiled in ReWrie.

Lemma 2 relates stages in a pipeline of devices created with \texttt{iter}. The LHS below performs \( f \) and \( g \) in succession. The RHS performs \( f \) and \( g \) in the first stage and the identity function in the second stage. N.b., the RHS is not identical to \( \text{iter} (g \circ f) (g o_2) \) because the former has two stages while the latter has one.

**Lemma 2.** Let \( g :: b \rightarrow c, f :: a \rightarrow b, o_1 :: c, o_2 :: b. \) Then, we have:

\[ \text{iter } f o_2 \rightsquigarrow \text{iter } g o_1 \]

\[ = \text{iter } (g \circ f) (g o_2) \rightsquigarrow \text{iter } id o_1 \]

Lemma 3 relates \texttt{feed l} with \texttt{\textasciitilde} in terms of infinite streams. It gives a condition under which the pipeline may be reduced by one stage.

**Lemma 3.** Let \( l \) be an infinite stream and \( \varphi :: \text{Dev o, then: } \text{feed } l (\varphi \texttt{\textasciitilde} \text{iter } id o) = o : \text{feed } l \varphi \)

2) Correctness Theorem: The following theorem says that feeding an n-stage pipeline a stream of inputs is the same as mapping a composite function across those inputs, as long as the first \( n \) outputs are ignored.

**Theorem 2** (Correctness of Pipelining). Assuming that \( f = f_1 \circ \cdots \circ f_n \) and that \( l \) is an infinite stream, then:

\[ \text{map } f l \]

\[ = \text{drop } n (\text{feed } l (\text{iter } f_n o_n \texttt{\textasciitilde} \cdots \texttt{\textasciitilde} \text{iter } f_1 o_1)) \]

\[ \square \]

**Proof:**

First, define: \( F_0 = \text{id} \) and \( F_{i+1} = F_i \circ f_{i+1} \). Observe that, by Lemmas 1 and 2 (\( n-1 \) times),

\[ \text{iter } f_n o_n \texttt{\textasciitilde} \cdots \texttt{\textasciitilde} \text{iter } f_1 o_1 \]

\[ = \text{iter } F_n (F_{n-1} o_n) \]

\[ \texttt{\textasciitilde} \text{iter } id (F_{n-2} o_{n-1}) \]

\[ \texttt{\textasciitilde} \cdots \]

\[ \texttt{\textasciitilde} \text{iter } id (F_0 o_1) \]

\[ \{ f = F_n, F_0 = \text{id} \} \]

\[ = \text{iter } f (F_{n-1} o_n) \]

\[ \texttt{\textasciitilde} \text{iter } id (F_{n-2} o_{n-1}) \]

\[ \texttt{\textasciitilde} \cdots \]

\[ \texttt{\textasciitilde} \text{iter } id o_1 \]
VI. SUMMARY AND CONCLUSIONS

This paper considered the provably correct development of several reconfigurable designs and implementations of the Salsa20 stream cipher. The vehicle for this development is the ReWire language. ReWire is a sublanguage of the pure, functional language Haskell, and, as such, possesses a rigorous semantics that supports formal verification. Functional languages are generally quite expressive, and, consequently, the Salsa20 specifications in ReWire were quickly produced, concise and comprehensible, and elegant. Connect Logic—a previously unpublished part of ReWire—supports a structural style of development in a functional HDL. Connect Logic was key to rapidly prototyping Salsa20 in ReWire, especially in the introduction of pipelining optimizations to the specifications.

It is commonplace for hardware engineers to “think in diagrams”. Any circuit or device specification will include a diagram depicting the high-level structure of the device. This diagram domain abstraction is used as an informal guide for comprehending the design. But how do we express such structural notions in a functional language-based HDL like ReWire? To this end, we introduced an extension to ReWire called Connect Logic, that encapsulates the diagrammatic style directly in the syntax of ReWire. This paper defines Connect Logic and illustrates its use with a case study of the construction of an efficient, pipelined hardware design and implementation of the Salsa20 stream cipher. Furthermore, and more to the point, we verify the correctness of this device through equational reasoning on the ReWire source text.

New language abstractions are not typically cost free. There is usually some trade-off with respect to performance and language implementers attempt to minimize such overheads. Furthermore, new abstractions tend to be more useful in some situations than in others. The Salsa20 cipher was chosen as a test for ReWire to evaluate (1) how well cryptographic algorithms might be expressed in ReWire and (2) what performance trade-off, if any, might arise with respect to carefully hand-optimized implementations? The performance of the synthesized ReWire devices (as shown in Table 1) was quite good and, although there are not any published numbers on hand-optimized implementations of Salsa20 that afford direct comparison with our results, the achieved performance was in line with the only relevant publication in the area [11]. Question (1) concerns what is, admittedly, more of an aesthetic issue than a measurable quantity. Still, it is safe to say that the Salsa20 specifications in ReWire would be readily comprehensible to those with experience in functional programming.

More importantly, a clear advantage of the ReWire methodology is that the artifacts we produced were verified in the manner of ordinary functional programs directly on the text of the design. This is a point worth emphasizing: verification of ReWire programs takes place on the program itself. Because VHDL has no mathematical semantics, artifacts produced in VHDL (or in Verilog for that matter) would require an additional step in which the formal specification of the device would be encoded by hand in the logic of a theorem prover [4]. This hand-encoding is fraught with the potential for error as well as being quite time-consuming.

REFERENCES