# Performance Analysis of High Speed Radix-4 Booth Encoders in CMOS Technology

Nader Sharifi Gharabaghlo<sup>1\*</sup>, Tohid Moradi Khaneshan<sup>1</sup>

1- Urumi Graduate Institute, Urmia, West Azarbaijan, Iran. Email: m.n.sharifi@urumi.ac.ir (Corresponding author) Email: t.moradi@urmia.ac.ir

Received: May 2018

Revised: September 2018

Accepted: January 2019

# **ABSTRACT:**

This review paper deals with performance analysis of the published works for circuit level realization of radix-4 Booth encoder/decoders. Starting from general concept of Booth algorithm in brief form, the conventional truth table is discussed. Subsequently, the modifications which led to the circuit level implementations along with the complete and comparative analysis for the selected works, is provided. Simulations using HSPICE for TSMC 0.18µm CMOS technology and 1.8V power supply have been performed for comparing these works. Considering the required optimizations applied to the mentioned works, it can be deduced that 1.5 XOR gate level delay is reachable for radix-4 Booth encoding scheme while the output waveforms are free of any glitches. The optimized version of Booth encoder has been embedded in a 16x16 bit parallel multiplier in which, the measured latency after post layout simulations is 1992ps; which demonstrates the high potential of chosen radix-4 Booth encoding scheme for utilization in high speed parallel multipliers.

KEYWORDS: Parallel Multiplier, Booth Encoder, Radix-4, High Speed.

# 1. INTRODUCTION

Although the history of binary numbers and their mathematical operations exceed a period of century [1], but the hardware implementation of such systems using Integrated Circuits (ICs) does not overstep 30 years. Among the operations involving binary numbers the multiplication drew a lot of interest because of its importance in today's communication systems, DSP cores and computers [2]. Nowadays, the parallel multipliers play a significant role in those high speed systems and in most cases, the critical path delay which demonstrates the speed performance of the whole architecture belongs to the multiplier [3].

The most important part of speed enhancement for parallel multipliers pertains to the multiplication algorithm employed in the body of the system [4]. Depending on the size of the numbers to be multiplied, different algorithms have been employed. Some of the important ones are Wallace tree, Dadda tree and Booth encoding scheme where their mathematical origin starts from the middle years of 20<sup>th</sup> century. In 1951 Andrew Booth introduced his algorithm for multiplication of two signed numbers [5]. After that, Chris Wallace [6] and Luigi Dadda [7] announced their procedures in 1964 and 1965, respectively. Although there were other algorithms for fast multiplication, but because of simplicity for hardware implementation, these three algorithms found their popularity and most of the modern multipliers are based on one of them.

To design a parallel multiplier, three main building blocks must be cascaded [8] as follows:

- 1) Partial Product Generation Block (PPGB).
- 2) Partial Product Reduction Tree (PPRT).
- 3) Final Adder.

By considering the speed performance for the design of first stage, a comparative analysis demonstrates that Wallace algorithm performs better than Dadda structure [9]. On the other hand, none of these methods can outperform Booth algorithm. As a result, the first block is mainly implemented by means of Booth encoding system where Fig. 1 illustrates the general architecture of such multiplying system. For the second stage, Wallace tree and 4-2 compressors are the most famous candidates and for the final summation stage, the Carry Select Adder (CSA) have usually been employed along with other addition procedures.

The demand for high speed signal processing systems on one hand and the necessity for power reduction of such systems on the other hand organize the most important challenges in the design criteria of parallel multipliers [10]. Being started in the early 90s [11], the design of high speed multipliers still has its own problems and many system optimizations have been carried out to overcome such difficulties.



Fig. 1. General architecture of a Booth Multiplier.

One of the important issues about the speed behavior of a parallel multiplier pertains to the first stage where the partial products are being generated. Although so many structures for hardware implementation of Booth encoder/decoder have been reported in literature [12-17], but the research is still in progress to achieve an optimum architecture having the ability to operate in high frequencies while its power and active area consumptions are both low.

In this paper, the best reported works have been analyzed regarding their utilized truth tables for Partial Product Generation (PPG) so that their advantages and drawbacks can be evaluated by the reader. For better comparison, all of the selected circuits are redesignated and simulated by HSPICE using TSMC 0.18 $\mu$ m CMOS technology and 1.8V power supply. At the final step, the optimized Booth encoder/decoder which has 1.5 XOR gate level delay, is embedded in the body of 16×16 bit parallel multiplier and the latency from inputs to the outputs has been measured which demonstrates that the proposed Booth encoding scheme can be widely employed in high speed multiplication systems.

The paper is organized as follows. In section 2 the radix-4 Booth encoding is briefly discussed while section 3 belongs to the analysis of best reported works along with their comparison based on simulation results. Section 4 contains the optimized architecture for Booth encoder and it is followed with the design explanation of  $16 \times 16$  bit multiplier and finally, the conclusions have been summarized in section 5.

#### Vol. 13, No. 3, September 2019

# 2. RADIX-4 BOOTH ALGORITHM

Employing two's complement notation firstly introduced by Andrew Donald Booth in 1951, the procedure in which two signed binary numbers are being multiplied is known as Booth's multiplication algorithm [5]. In simple words, it is based on investigation of adjacent pairs of bits of the multiplier Y to recode it. To start the recoding process, a special notation system must be defined in which the absolute values for 1, 0 and -1 are being considered. This is known as Signed Digit (SD) encoding system where 1 and 0 remain unchanged while -1 is treated as 1 [18].

Considering *Y* as multiplier and *X* as multiplicand, if one desires to demonstrate the multiplication routine of  $X \times Y$  using Booth algorithm, *Y* must be rewritten in SD encoding system just as follows [19]:

$$Y = -y_{n-1}2^{n-1} + \sum_{i=0}^{n-2} y_i 2^i$$
(1)

Where *n* indicates the number of bits representing *Y* and  $y_{n-1}$  denotes the sign bit. By some changes in (1), it is obvious that:

$$Y = \sum_{i=0}^{n-1} (y_{i-1} + y_i) 2^i$$
(2)

In which  $y_{-1} = 0$ . A close look to (2) reveals that the multiplier Y should be recoded to its scale factors (-1, 0, 1) which justifies the use of SD system. To produce these factors, the multiplier should be grouped in the classes consisting of 2 bits and with the help of truth table illustrated in Table 1, the encoding will be carried out.

| Y <sub>i</sub> | $Y_{i-1}$ | Partial Product |
|----------------|-----------|-----------------|
| 0              | 0         | 0×Multiplicand  |
| 0              | 1         | 1×Multiplicand  |
| 1              | 0         | -1×Multiplicand |
| 1              | 1         | 0×Multiplicand  |

 Table 1. Radix-2 Booth encoding truth table.

This process known as radix-2 Booth encoding employs 3 types of operations to generate partial products. Based on Table 1, these functions are:

- 1) For 00 and 11, the product is multiplication of 0 to multiplicand.
- 2) For 01, the product is multiplicand itself.
- 3) For 10, the multiplicand must be complemented.

Although it seems that the PPG becomes very simple, but this algorithm has two main disadvantages

which extremely limits its utilization for parallel multipliers and careful considerations must be fulfilled to make it applicable for hardware implementation of multipliers [20]. These drawbacks are:

- 1) For isolated 1s, the algorithm becomes inefficient.
- 2) The number of adding and subtracting operations is variable.

To resolve the mentioned drawbacks, the radix-4 Booth encoding has been proposed by system designers which is popular as Modified Booth Encoding (MBE) scheme and is widely utilized in parallel multiplier design criteria. Considering (1) for representation of multiplier in SD system, in radix-4 system Y can be rewritten as:

$$Y = \sum_{i=0}^{n-1} (y_{i-1} + y_i - 2y_{i+1}) 2^i$$
(3)

Following the method mentioned for radix-2 system, the multiplier Y should be recoded to its scale factors (-2, -1, 0, 1, 2). By defining these factors as (-2X, -X, 0, X, 2X), one can say that the multiplier must be grouped in the classes consisting of 3 bits and with the help of truth table shown in Table 2, the encoding will be done [17].

**Table 2.** Radix-2 Booth encoding truth table.

| $Y_{i+1}$ | Y <sub>i</sub> | $Y_{i-1}$ | Partial Product |
|-----------|----------------|-----------|-----------------|
| 0         | 0              | 0         | 0×Multiplicand  |
| 0         | 0              | 1         | 1×Multiplicand  |
| 0         | 1              | 0         | 1×Multiplicand  |
| 0         | 1              | 1         | 2×Multiplicand  |
| 1         | 0              | 0         | -2×Multiplicand |
| 1         | 0              | 1         | -1×Multiplicand |
| 1         | 1              | 0         | -1×Multiplicand |
| 1         | 1              | 1         | -0×Multiplicand |

To encode, 4 types of operations must be fulfilled for generation of partial products as summarized below:

- 1) For scale factor 0, the multiplicand is zeroed.
- 2) For scale factor 1, the multiplicand will directly be transferred to output.
- 3) For scale factor 2, the partial product bits are shifted one position to the left.
- 4) For sign extension scale factor, all of the multiplicand bits will be converted.

The main advantage of MBE scheme over radix-2 system in addition of solving the mentioned drawbacks

is the reduction of partial product lines to half which emphasizes the efficiency of this system and draws the attention of circuit designers for its utilization in parallel multipliers.

This simplification process can be continued to obtain radix-8 Booth encoding scheme and there are several works reported for hardware implementation of this system [21-23]. But as the complexity of such system is much higher than MBE architectures, it takes its first steps and is out of scope for this article.

# **3. HARDWARE IMPLEMENTATION OF MBE SYSTEM**

By means of definitions provided in previous section for general truth table of MBE scheme, hardware implementation of Booth multipliers started from the beginning of 1990s. Although a great speed enhancement had been achieved by means of this method, but the need for frequencies over GHz made the designers reinvestigate radix-4 Booth algorithm for any optimizations.

Considering the XOR gate as standard one for propagation latency calculation, with the help of designed gates in [24] the obtained latency for earlier versions of Booth encoding circuits exceeds 5 XOR gate level delay. Also, the existence of glitch at the output waveforms was pushing the designers to employ buffers at the low latency paths to equalize all routes from inputs to the outputs at the expense of higher power dissipations.

The year 2000 was the beginning of new era on the design criteria of high performance MBE architectures which was achieved by modifications applied to the general truth table (Table 2). Most of the reported works on the duration of 10 years could achieve 4 XOR gate level delay until in [15] and [17] the designers claimed the accomplishment to latencies less than 3 XOR gate level delay.

As our objective in this work is the analysis of best reported works to propose an optimized version of Booth encoding structure, four of the best reported works have been presented here with their design considerations. The advantages along with the drawbacks of these architectures are carefully being studied and their performance will be compared using simulation results.

| $y_{2i+1}$ | $y_{2i}$ | $y_{2i-1}$ | $y_i'$ | $X_{sel}$ | $2X_{sel}$ | NEG |
|------------|----------|------------|--------|-----------|------------|-----|
| 0          | 0        | 0          | 0      | 0         | 0          | 0   |
| 0          | 0        | 1          | 1      | 1         | 0          | 0   |
| 0          | 1        | 0          | 1      | 1         | 0          | 0   |
| 0          | 1        | 1          | 2      | 0         | 1          | 0   |
| 1          | 0        | 0          | -2     | 0         | 1          | 1   |
| 1          | 0        | 1          | -1     | 1         | 0          | 1   |
| 1          | 1        | 0          | -1     | 1         | 0          | 1   |
| 1          | 1        | 1          | 0      | 0         | 0          | 1   |

Fig. 2. The truth table reported in [14].

In [14] an extended truth table for MBE system was presented which is shown in Fig. 2. Based on this truth table, two new variables were defined and the sign bit parameter was denoted by NEG. By means of these variables, two separate circuits have been proposed in which the former circuit is used for encoding and the latter one performs the operation of decoding. The whole structure is shown in Fig. 3 and can achieve the latency of 4 XOR gates.



**Fig. 3.** Designed architecture in [14] (a) Booth encoder (b) Booth decoder.

A closer look to the system of Fig. 3 demonstrates that the gate level delay for decoder circuit is 3 XOR logic gates which is a high value and degrades the performance of that part. But the main drawback of this system is the separation of encoding and decoding stages and results in increment of total gate count and active area consumption on chip. Meanwhile, the latencies for different inputs are not equal and as a result, the output waveforms will contain glitch. It must be mentioned that the critical path belongs to the path between  $y_{2i-1}$  and the partial product.

In this system, if the *NEG* input can be carried to the first stage of the decoder circuit, the latency of the system can be reduced considerably.

In [15] and [16], a new truth table for MBE scheme was introduced where three variables along with the sign bit parameter were used for hardware implementation of this system. The utilized truth table in [15] is shown in Fig. 4 and again, the design objective was put on implementation of encoder and decoder circuits separately.

With the help of Fig. 4, the designed structure for MBE architecture is illustrated in Fig. 5 and the

#### Vol. 13, No. 3, September 2019

interesting issue about this circuitry is the absence of 2X parameter and substitution of Z for generation of partial products.

| $b_{2i+1}$ | $b_{2i}$ | $b_{2i-1}$ | $d_i$   | Neg | X | 2X | Z |
|------------|----------|------------|---------|-----|---|----|---|
| 0          | 0        | 0          | 0       | ×   | 0 | 0  | 0 |
| 0          | 0        | 1          | 1       | 0   | 1 | 0  | 1 |
| 0          | 1        | 0          | 1       | 0   | 1 | 0  | 1 |
| 0          | 1        | 1          | 2       | 0   | 0 | 1  | 1 |
| 1          | 0        | 0          | -2      | 1   | 0 | 1  | 1 |
| 1          | 0        | 1          | $^{-1}$ | 1   | 1 | 0  | 1 |
| 1          | 1        | 0          | -1      | 1   | 1 | 0  | 1 |
| 1          | 1        | 1          | 0       | ×   | 0 | 0  | 0 |

Fig. 4. Proposed truth table of [15].

Booth Encoder :



(a)

Booth Decoder :



(b)

**Fig. 5.** Designed architecture in [15] (a) Booth encoder (b) Booth decoder.

The main advantage of this structure is its good speed performance compared to its counterparts as the latency of the critical path, which has been reduced to almost 2 XOR logic gates. Also, because of uniform paths from inputs to the outputs the output waveforms will be free of any glitches while for one PPG process the active area consumption will arguably be low.

Following the same procedure discussed for [15], Fig. 6 shows the truth table reported in [16] and as it is obvious, except for the case where all of the three bits of the multiplier *B* are 1 (state 111),  $X1_a$  and  $X1_b$  have opposite logic values.

| $b_{i+1}$ | $b_i$ | <i>b</i> <sub><i>i</i>-1</sub> | value | X1_a | X2_b | Ζ | Neg |
|-----------|-------|--------------------------------|-------|------|------|---|-----|
| 0         | 0     | 0                              | 0     | 1    | 0    | 1 | 0   |
| 0         | 0     | 1                              | 1     | 0    | 1    | 1 | 0   |
| 0         | 1     | 0                              | 1     | 0    | 1    | 0 | 0   |
| 0         | 1     | 1                              | 2     | 1    | 0    | 0 | 0   |
| 1         | 0     | 0                              | -2    | 1    | 0    | 0 | 1   |
| 1         | 0     | 1                              | -1    | 0    | 1    | 0 | 1   |
| 1         | 1     | 0                              | -1    | 0    | 1    | 1 | 1   |
| 1         | 1     | 1                              | 0     | 1    | 0    | 1 | 1   |

Fig. 6. Introduced truth table in [16].

By means of proposed relations and following the same target for separate implementation of encoder and decoder circuits, the designed structure which is illustrated in Fig. 7, achieves gate level delay equal to 3 XOR logic gates which is a good speed enhancement.



**Fig. 7.** Designed architecture in [16] (a) Booth encoder (b) Booth decoder.

The main disadvantage of this structure is nonuniformity of the paths from inputs to the outputs because of three input OR gate in decoder circuit. To equalize the latency in all paths, the two input OR gate must be redesigned with different transistor sizes which will considerably degrade the speed performance of whole system.

A brief comparison depicts that in contrast with [15], the proposed truth table reported in [16], despite of the similarities in utilized truth table, cannot compete the MBE circuitry of Fig. 5 from the viewpoint of speed behavior while the total gate count is also higher for the architecture of Fig. 7.

In spite of differences between structures designed in [14], [15] and [16], all of them along with most of the previous works followed a same routine to design

#### Vol. 13, No. 3, September 2019

their Booth encoding circuits. They all were focused on separate circuits for Booth encoder and Booth decoder for PPG and as a result, none of them could achieve latencies less than 2 XOR logic gates. But in [17], the designers introduced a new truth table which is shown in Fig. 8 and their emphasis was on merging of encoding and decoding stages to improve the speed of the system and reduce the total transistor count for lower active area consumption on chip.

|     | b <sub>2i+1</sub> | $b_{2i}$ | b <sub>2i-1</sub> | di | X | N | L | Neg |
|-----|-------------------|----------|-------------------|----|---|---|---|-----|
| III | 0                 | 0        | 0                 | 0  | 0 | 1 | 0 | 0   |
| Ι   | 0                 | 0        | 1                 | 1  | 1 | 0 | 0 | 0   |
| Ι   | 0                 | 1        | 0                 | 1  | 1 | 0 | 0 | 0   |
| Π   | 0                 | 1        | 1                 | 2  | 0 | 0 | 1 | 0   |
| III | 1                 | 0        | 0                 | -2 | 0 | 1 | 0 | 1   |
| Ι   | 1                 | 0        | 1                 | -1 | 1 | 0 | 0 | 1   |
| Ι   | 1                 | 1        | 0                 | -1 | 1 | 0 | 0 | 1   |
| II  | 1                 | 1        | 1                 | 0  | 0 | 0 | 1 | 1   |

Fig. 8. Designed truth table in [17].

As Fig. 8 illustrates, three parameters denoted as X, N and L are defined along with sign bit extension parameter *Neg* to cover three possible states and simplify the truth table. By means of this table, the circuit level implementation has become possible which considerably enhances the speed behavior. The proposed architecture shown in Fig. 9 achieves latency of one XOR logic gate plus one transistor which is the lowest value reported in literature and outperforms other designs in speed race.



One important point which must be considered for the circuit of Fig. 9 is the output partial product logic level. In CMOS technologies the NMOS transistor is a good conductor for logic value of zero while the PMOS transistor treats well with logic 1 value. Considering this point, the output partial product in Fig. 9 for some input states may differ a threshold voltage of MOS transistor from the desired output value which necessitates the use of buffers at the output node for full range recovery and it must be considered in parallel multiplier design.



Fig. 10. Comparison between different MBE structures simulated with TSMC 0.18µm CMOS technology.

For better comparison between these works, postlayout simulations using HSPICE for TSMC standard 0.18µm CMOS technology at a 1.8V power supply were performed for redesigned architectures of [14], [15], [16] and [17]. Each circuit was loaded with buffers to prepare a more realistic simulation environment.

The propagation delay was measured from the point where the earliest transition reaches 50% of  $V_{dd}$ , to 50%  $V_{dd}$  of the latest output signal.

The final result which is shown in Fig. 10, demonstrates that the lowest value of delay belongs to MBE structure of [17] and is 123ps.

Table 3 also demonstrates the comparison between these works based on their circuitry and simulation results obtained by the authors.

It must be mentioned that there are other state of the art works reported in literature dealing with hardware implementation of MBE scheme [24-26]. In [24], the Vol. 13, No. 3, September 2019

design is discussed using CNFET technology while in [25], the concentration is on FPGA implementation. Also, in [27] another procedure which is based on approximate error-tolerant computing, the multiplier is proposed using an algorithm where the final products have been generated with an acceptable percentage of error. But as the emphasis in this work is on CMOS implementation of accurate output multiplier, such works are not considered for further analysis.

 Table 3. Comparison between selected MBE architectures.

| Work                      | [14]    | [15]                         | [16]    | [17]                     |
|---------------------------|---------|------------------------------|---------|--------------------------|
| Technology(µm)            | 0.18    | 0.18                         | 0.18    | 0.18                     |
| Gate Count                | 8       | 6                            | 8       | 4.5                      |
| Gate Level<br>Delay (XOR) | 4 gates | 2 gates +<br>1<br>transistor | 3 gates | 1 gate + 1<br>transistor |
| Glitch Effect             | Yes     | No                           | Yes     | No                       |
| Delay(ps)                 | 543     | 228                          | 392     | 123                      |

## 4. THE MULTIPLIER DESIGN 4.1. The Optimized MBE Structure

Considering the points mentioned for MBE architecture of Fig. 9 and in order to achieve full logic level without using any buffers at the output nodes, some modifications are performed to the circuit of Fig. 9 and the optimized version of MBE architecture is shown in Fig. 11.



Partial Product Fig. 11. Optimized MBE Architecture.

The main differences between the proposed scheme and circuitry of Fig. 9 are as follows:

- 1) In the newly designed structure, the NMOS transistor connected to  $V_{dd}$  is replaced with PMOS transistor as PMOS is a good conductor of logic 1 value and its gate is biased with  $\overline{Neg}$ .
- In the paths with one threshold voltage drop, the complement transistors are added to constitute a Transmission Gate (TG) pair for full level recovery.
- 3) The XOR gate is replaced with two output XOR/XNOR gate reported in [27] whilst the latency is equal for both gates.

Although this structure adds a latency of one transistor to the schematic of Fig. 9, but it frees us of employing buffers for full level recovery. While the gate level delay becomes 1.5 XOR logic gate for this circuit, post-layout simulation results using HSPICE depicts a delay of 168ps, which is better than measured delays reported for [14], [15] and [16].

#### 4.2. The Multiplier

A 16x16 bit parallel multiplier is designed in which the PPGB is implemented by means of proposed MBE structure shown in Fig. 11. For the PPRT the 4-2 compressors are selected as the building blocks. There are so many works reported in literature for hardware implementation of 4-2 compressors [27-29]. The lowest latency was related in [29] where the gate level delay less than 2 XOR logic gates has been achieved. That configuration is employed in the design of multiplier.



Final 32 Bit Product Fig. 12. Structure of the designed multiplier.

For the third stage Carry Select Adder (CSA) is chosen for fast summation of remaining two rows of the bits. Since the final result will contain 32 bits, two Vol. 13, No. 3, September 2019

16-bit CSAs can be used for final summation [30]. Since the first four bits are being produced inside the PPRT, two 14-bit CSAs are employed. The final structure for multiplier is shown in Fig. 12.

Based on post-layout simulations by HSPICE for TSMC standard  $0.18\mu m$  CMOS technology at a 1.8V power supply, the measured delay for 4-2 compressor of [29] was 192ps, while the obtained latency for 28 Bit addition was 1320ps.

To measure total delay of the proposed multiplier, the layout of whole system was drawn and all parasitics have been extracted. Fig. 13 shows the layout of the designed 16x16 bit multiplier which occupies an active area of  $386\mu$ m×153µm on chip.



Fig. 13. Layout of the proposed multiplier.

The measured delay for the proposed multiplier is 1992ps and depicts that this system can operate well at the frequency of 500MHz. Table 3 shows the design specifications of proposed multiplier.

| Technology(µm)                  | 0.18        |
|---------------------------------|-------------|
| Power Supply(V)                 | 1.8         |
| <b>Propagation Delay(ps)</b>    | 1992        |
| Layout Size                     | 386µm×153µm |
| <b>Operating Frequency(MHz)</b> | 500         |

# Table 3. Design Specifications of Designed Multiplier.

#### 5. CONCLUSION

In this review paper the best reported works for radix-4 Booth encoder-decoders have been completely studied. All the advantages along with drawbacks of these works were studied carefully and an optimized version for circuit level implementation of MBE scheme has been proposed which shows 1.5 XOR gate level delay without any glitches at the outputs.

Post-Layout simulation results by HSPICE for TSMC standard 0.18µm CMOS technology at a 1.8V power supply shows a delay of 168ps for implemented circuit which illustrates its high potential for employment in high speed parallel multipliers.

To prove this, by means of the designed architecture

a 16x16 bit parallel multiplier was implemented in which the measured delay from inputs to the outputs is less than 2ns and justifies that the multiplier can successfully operate at the frequency of 500MHz while the active size of the layout does not exceed 0.06mm<sup>2</sup>.

#### REFERENCES

- [1] Anton Glaser, "History of Binary and Other Nondecimal Numeration", Tomash Publishers, 1981.
- [2] Ohkubo N., Suzuki M., Shinbo T. et al., "A 4.4 ns CMOS 54 54-b Multiplier Using Pass-Transistor Multiplexer," *IEEE Journal of Solid-State Circuits*, Vol. 30, Issue 3, pp. 251-257, 1995.
- [3] Abu-Khater I.S., Bellaouar A. and Elmasry, M.I., "Circuit Techniques for CMOS Low-Power High-Performance Multipliers," *IEEE Journal of Solid-State Circuits*, Vol. 31, Issue 10, pp. 1535-1546, 1996.
- [4] Behrooz Parhami, "Computer Arithmetic: Algorithms and Hardware Designs", New York, Oxford University Press, 2000.
- [5] Andrew D. Booth, "A Signed Binary Multiplication Technique," The Quarterly Journal of Mechanics and Applied Mathematics, Volume IV, Pt. 2, 1951.
- [6] C. S. Wallace, "A Suggestion for a Fast Multiplier," *IEEE Trans. on Computers*, Vol. 13, pp. 14-17, 1964.
- [7] L. Dadda, "Some Schemes for Parallel Multipliers," Alta Frequetiza, Vol. 34, pp. 349-356, 1965.
- [8] Wen-Chang Yeh and Chein-Wei Jen, "High-Speed Booth Encoded Parallel Multiplier Design," *IEEE Transactions on Computers*, Vol. 49, No. 7, July 2000.
- [9] Whitney J. Townsend, Earl E. Swartzlander Jr., Jacob A. Abraham, "A Comparison of Dadda and Wallace Multiplier Delays," *Proceedings of SPIE* -*The International Society for Optical Engineering*, 2003.
- [10] D. Naresh and G. Babu Kande, "High Speed Signed multiplier for Digital Signal Processing Applications," IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE), Vol. 8, Issue 2, pp. 57-61.
- [11] K. Yano, T. Yamanaka, T. Nishida, M. Saito, K. Shimohigashi, A. Shimizu, "A 3.8-ns CMOS 16\*16b Multiplier Using Complementary Pass-Transistor Logic," *IEEE Journal of Solid-State Circuits*, Vol. 25, Issue: 2, pp. 388-395, Apr 1990.
- [12] Hsin-Lei Lin, Chang R.C. and Ming-Tsai Chan, "Design of a Novel Radix-4 Booth Multiplier," The 2004 IEEE Asia-Pacific Conference on Circuits and Systems, Vol. 2, pp. 837-840, 2004.
- [13] Shiann-Rong Kuang, Jiun-Ping Wang, and Cang-Yuan Guo, "Modified Booth Multipliers with a Regular Partial Product Array," IEEE Transactions on Circuits and Systems—II: Express Briefs, Vol. 56, No. 5, pp. 404-408, May 2009.
- [14] Kyung-Ju Cho, Kwang-Chul Lee, Jin-Gyun Chung, and Keshab K. Parhi, **"Design of Low-Error Fixed-**

Width Modified Booth Multiplier," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, Vol. 12, No. 5, pp. 522-531, May 2004.

- [15] A. Fathi, S. Azizian, R. Fathi, H.G. Tamar, "Low latency, Glitch-Free Booth Encoder-Decoder for High Speed Multipliers," *IEICE Electronics Express*, Vol. 9, No. 16, pp. 1335-1341, 2012.
- [16] Ravindra P. Rajput, M.N. Shanmukha Swamy, "High speed Modified Booth Encoder Multiplier for Signed and Unsigned Numbers," 14th International Conference on Modelling and Simulation, pp. 649-654, 2012.
- [17] A. Fathi, S. Azizian, Kh. Hadidi, A. Khoei, "Ultra High Speed Modified Booth Encoding Architecture for High Speed Parallel Accumulations," *IEICE transactions on electronics*, Vol. 95, No. 4, pp. 706-709, 2012.
- [18] Vincent P. Heuring, Harry F. Jordon, "Computer Systems Design and Architecture", *Pearson Education*, Singapore, 2003.
- [19] Sukhmeet Kaur, Suman, Manpreet Signh Manna, "Implementation of Modified Booth Algorithm (Radix 4) and its Comparison with Booth Algorithm (Radix-2)," Advance in Electronic and Electric Engineering, Vol. 3, No. 6, pp. 683-690, 2013.
- [20] Sakshi Rajput, Priya Sharma, Gitanjali, and Garima, "High Speed and Reduced Power - Radix-2 Booth Multiplier," International Journal of Computational Engineering & Management (IJCEM), Vol. 16, Issue 2, pp. 25-31, March 2013.
- [21] Ramya Muralidharan, and Chip-Hong Chang, "Radix-8 Booth Encoded Modulo 2n-1 Multipliers With Adaptive Delay for High Dynamic Range Residue Number System," IEEE Transactions on Circuits and Systems—I: Regular Papers, Vol. 58, No. 5, pp. 982-993, May 2011.
- [22] Ramya Muralidharan, and Chip-Hong Chang, "Radix-4 and Radix-8 Booth Encoded Multi-Modulus Multipliers," IEEE Transactions on Circuits and Systems—I: Regular Papers, Vol. 60, No. 11, pp. 2940-1952, Nov 2013.
- [23] Honglan Jiang, Jie Han, Fei Qiao, and Fabrizio Lombardi, "Approximate Radix-8 Booth Multipliers for Low-Power and High-Performance Operation," *IEEE Transactions on Computers*, Vol. 65, No. 8, pp. 2638-2644, Aug 2016.
- [24] Darvishi, Mehdi, and Mehdi Bagherizadeh, "A New High-Performance Bridge Structure for 4-to-2 Compressor using CMOS and CNFET Technology," International Journal of Modern Education and Computer Science (IJMECS), .6: 48, 2017.
- [25] Farmani, Ali, and Hossein Balazadeh Bahar, "Hardware Implementation of 128-Bit AES Image Encryption with Low Power Techniques on FPGA to VHDL," Majlesi Journal of Electrical Engineering 6.4, 2012.
- [26] Liu, Weiqiang, et al, "Design of Approximate Radix-4 Booth Multipliers for Error-Tolerant

Vol. 13, No. 3, September 2019

**Computing,**" *IEEE Transactions on Computers*, 2017.

- [27] Chip-Hong Chang, Jiangmin Gu, and Mingyan Zhang, "Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2 Compressors for Fast Arithmetic Circuits," *IEEE Transactions on Circuits and Systems I*, Vol. 51, Issue 10, pp. 1985-1997, 2004.
- [28] Amir Fathi, Sarkis Azizian, Khayrollah Hadidi, Abdollah Khoei and Amin Chegeni, "CMOS Implementation of a Fast 4-2 Compressor for Parallel Accumulations," 2012 IEEE International

Symposium on Circuits and Systems (ISCAS), pp. 1476-1479, May 2012.

- [29] Amir Fathi, Sarkis Azizian, Khayrollah Hadidi and Abdollah Khoei, "A Novel and Very Fast 4-2 Compressor for High Speed Arithmetic Operations," *IEICE Trans. Electron.*, Vol. E95-C, No. 4, April 2012.
- [30] B. Ramkumar, and Harish M. Kittur, "Low-Power and Area-Efficient Carry Select Adder," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, Vol. 20, No. 2, pp. 371-375, Feb 2012.