# Reducing Inference Latency with Concurrent Architectures for Image Recognition

Ramyad Hadidi<sup>†1</sup>, Jiashen Cao<sup>†1</sup>, Michael S. Ryoo<sup>2</sup>, Hyesoon Kim<sup>1</sup>

Georgia Institute of Technology<sup>1</sup>, Stony Brooks University<sup>2</sup>
† Same Contribution

**Abstract.** Satisfying the high computation demand of modern deep learning architectures is challenging for achieving low inference latency. The current approaches in decreasing latency only increase parallelism within a layer. This is because architectures typically capture a singlechain dependency pattern that prevents efficient distribution with a higher concurrency (i.e., simultaneous execution of one inference among devices). Such single-chain dependencies are so widespread that even implicitly biases recent neural architecture search (NAS) studies. In this visionary paper, we draw attention to an entirely new space of NAS that relaxes the single-chain dependency to provide higher concurrency and distribution opportunities. To quantitatively compare these architectures, we propose a score that encapsulates crucial metrics such as communication, concurrency, and load balancing. Additionally, we propose a new generator and transformation block that consistently deliver superior architectures compared to current state-of-the-art methods. Finally, our preliminary results show that these new architectures reduce the inference latency and deserve more attention.

# 1 Introduction & Motivation

Increasingly deeper and wider convolution/deep neural networks (CNN/DNN) [37, 40,51] with higher computation demands are continuously attaining higher accuracies. Nevertheless, the high computation and memory demands of these DNNs hinder achieving low inference latency [14]. Although current platforms exploit parallelism, we discover that, since most architectures capture a single-chain dependency pattern [26, 38, 39], shown in Figures 1a & b, we cannot efficiently extend concurrency and distribution beyond current explicit parallelism exposed within intra-layer computations (i.e., matrix-matrix multiplications) to reduce the latency of an inference. In other words, distribution and concurrency, if any, is implemented at data level [17], which only increases the throughput.

The status quo approaches in reducing the inference latency are always applied after an architecture is defined (e.g., reducing parameters with weight pruning [16] or reducing computation with quantization [43]). Additionally, for extremely large architectures, limited model parallelism is applied on final layers (i.e., large fully-connected layers that do not fit in the memory [11–13]). However, since model-parallelism methods do not change the architecture, distributing all



Fig. 1. Sampled Architectures Overview – (a) & (b) Limited concurrency and distribution due to single-chain dependency. (c) Improved concurrent architecture.

layers with such methods adds several synchronization/merging points, incurring high communication overheads (Figure 1a & b). We discover that the single-chain inter-layer dependency pattern, common in all the well-known architectures and even in state-of-the-art neural architecture search (NAS) studies [48], prevents the efficient model distribution for reducing inference latency.

This visionary paper addresses the single-chain data dependency in current architecture designs and endeavour to inspire discussion for new concurrent architectures. To do so, first, we analyze architectures generated by recent unbiased NAS studies [48] and discover that scaling/staging blocks implicitly enforce dependencies. Then, we generate new architectures with prior and our new distance-based network generators using our new probabilistic scaling block. Then, for quantitatively comparing generated architectures, we propose a concurrency score that encapsulates important metrics such as communication, load balancing, and overlapped computations, by reformulating the problem as a hypergraph partitioning problem [4,27]. Based on the scores and experiments, our generated architectures have higher concurrency and are more efficient for distribution than current architectures, an example of which is shown in Figure 1c. Additionally, as shown in Figure 2, they provide competitive accuracy while delivering high concurrency, directly proportional to inference latency (Figure 8). Our experiment results (on over 1000 samples) show that our architectures achieve 6-7x faster inference time. As an added benefit, the current methods in reducing the inference latency can be applied on top of our generated architectures. The following is our contribution:

Addressing Single-Chain Data Dependencies: Our concurrent architectures created by network generators (specially the new distance-based generator) break current biased designs by delivering high concurrency.



Fig. 2. Accuracy vs. Concurrency Score – Randomly sampled concurrent architectures generated with our NAS consistently achieve competitive accuracies with a higher concurrency and distribution opportunities during an inference (Flower-102, §3).

Proposing Representative Concurrency Score: Our problem formulation based on hypergraph theory encapsulates critical metrics to quantitatively compare all architectures for efficient distribution and concurrency.

# 2 Related Work

Computation & Parameter Reduction: Reducing computation and parameters to reduce inference latency is an active research area. These techniques are applied after an architecture is fixed. One common approach is to remove the weak connections with weight pruning [2, 16, 30, 45, 49], in which the close-to-zero weights are pruned away. It is also been shown that moderate pruning with iterative retraining enables superior accuracy [16]. Quantization and low-precision inference [6, 10, 24, 29, 43] change the representation of numbers for faster calculations. Several methods also have been proposed for binarizing the weights [7, 28, 36]. The concurrent architectures can also benefit from these approaches, making them complementary to further reduce inference latency.

Concurrency & Distribution: With increasingly larger architectures and widespread usage of deep learning, distribution have gained attention [8, 11, 20,32,42]. Most of the techniques either exploit data or model parallelism [8,26]. Data parallelism only increases the throughput of the system and does not affect the latency. Model parallelism divides the work of a single inference. However, model parallelism keeps the connections intact. Thus, applying model parallelism on intra-layer computations results in a huge communication overhead for sharing the partial results after each layer due to existing single-chain dependency. SplitNet [22] focuses on improving the concurrency opportunity within an architecture by explicitly enforcing dataset semantics in the distribution of only the final layers. Each task needs to be handcrafted individually for each dataset by examining the semantics in the dataset. In this paper, we propose concurrent architectures that is generated by NAS by considering all important factors for distribution, which has not been explored by prior work.

Neural Architecture Search: With the growing interests in automating the search space for architectures, several studies [3, 31, 37, 41, 48, 50, 51] have proposed new optimization methods. Most of these studies [50,51] utilize an LSTM controller for generating the architecture. However, as pointed out in [48], the search space in these studies is determined by the implicit assumption in network generators and sometimes explicit staging (i.e., downsampling spatially while upsampling channels). Although Xie et al. [48] aimed to remove all the implicit wiring biases from the network generator by using classical random graph generator algorithms, they introduced a scaling/staging bias in the final architecture to deal with a large amount of computation. Such stagings create a merging point after a stage where all the features are collected and downsampled before the next stage. Hence, the generated architecture still carries the single-chain of dependency which limits the further concurrency. In contrast, our proposed architectures do not enforce such a dependency by removing this bias. Moreover, compared to prior work, our target is to reduce inference latency by increasing concurrency, which has not been explored before.

# 3 Concurrent Architectures

Here, we propose concurrent architectures that break the single-chain dependency pattern for enabling concurrent execution of an inference. To improve distribution and concurrency, we aim to search for an architecture that has minimal communication overhead and is load balanced when it is distributed. To do so, the following provides the general problem formulation, while §3.1 and §3.2 describe our implementation details. In §3.3, we extend the representation to quantitatively study distribution and concurrency opportunities, derived by reformulating the problem as a hypergraph partitioning problem.

Overview: The current design of neural architectures is optimized for prediction accuracy and has an implicit bias towards the single-chain approach [48, 50], as we discussed in §1. This bias limits concurrency and distribution for reducing inference latency. In other words, only the computation within a layer is performed in parallel and not the computation within a model. To tackle this challenge, we aim to consider concurrency and distribution during the design stage and test if such architectures provide higher concurrency with good accuracy. To do so, first, we use network generators to create a random graph structure, which represents a potential architecture. Among all generated architectures, we sample (without any optimized search) and evaluate generated architectures with our proposed concurrency score. Then, we transform the graph to a DNN and perform experiments. Our final results show a promising direction worth exploring. **DAG Representation:** A neural architecture,  $\mathcal{N}$ , can be represented as a directed acyclic graph (DAG) because the computation flow always goes in one direction without looping. We define a DAG as  $\mathcal{G} = (V, E)$  where V and E are sets of vertices and edges, respectively. We define a network generator, f, as a function that constructs random DAG. f creates the edge set, E, and defines the source and sink vertices for each edge, regardless of the type of the vertices. Although network generators could be deterministic (e.g., a generator implemented with NAS approach), we are interested in stochastic network generators. The reasons are two-fold. First, the stochastic generator provides a larger search space than the deterministic generator, so it is more likely to remove any bias. Second, since, unlike prior work, we don not use scaling/staging to glue different parts of our NAS generated network [48] (shown in Figure 1b), stochastic generators provide more options for a potential solution. Note that the generated DAG only represents the dataflow and does not include the weights, which are learned in subsequent steps. §3.1 provides more details about our network generators and how we utilize them to create a DAG.

**DAG to DNN:** Once we have found a promising DAG representation after the concurrency score study, we transform the DAG into an actual DNN. Vertices in DAG are components (e.g., layers or sub-networks) and edges are connections. Within the process of transformation, we convert the nodes in DAG to a block of layers and connect blocks with its corresponding edge in DAG. Each vertex,  $V_i$ , has several properties such as type of the layer and its properties (e.g., depth, width, activation size, etc.). In this paper, we use a uniform computation in vertices: ReLU, 3×3 separable convolution [5], and batch normalization [19].

#### 3.1 Network Generators

We use three classical random graph generators as baselines. Additionally, after discovering that state-of-the-art generators do not generate a concurrent architecture, we propose a new graph generator with distance-based heuristics. Below, we describe the generators identified by how their stochastic nature influences the graph. Note that although the first three generators are based on [48], to generate concurrent architectures, we have removed the introduced staging blocks, which enforces the single-chain dependency in prior work. Thus, all the studied architectures in this work are novel and have never been studied before.

Once we obtain an undirected random graph from the generator, we convert the undirected graph to DAG by using the depth-first search algorithm. The vertices with smaller vertex ID is traversed earlier than vertices with larger ID. As the final step, we add an input vertex to all vertices without predecessors and an output vertex to all vertices without successors. This ensures that we obtain a DAG with a single source and sink.

- (1) Independent Probability: In this group, the probability of adding an edge is independent of other properties. Similar to the Erdős and Rényi model (ER) [9], in which an edge exists with a probability of P. Generators with independent probability completely ignore the graph structure and create a connected graph (Figure 3a) that is hard to efficiently distribute.
- (2) Degree Probability: In this group, the probability of adding an edge is defined by the degree of one of its connected vertices. A vertex with a higher degree has more probability of accepting a new edge. Figure 3b shows an example of such a generator. Barabási-Albert model (BA) [1], first adds M disconnected vertices, then for the total number of vertices until N, it adds a total of M edges with a linear probability proportional to the degree of each vertex (i.e., a total of

- M(N-M) edges). Generators with degree probability create a tree-structured graph, in which at least one vertex is strongly connected to other vertices. Such a graph structure is hard to distribute since all the vertices are dependent on at least one vertex, if not more.
- (3) Enforced Grouping: In this group, initially, a pre-defined grouping is performed on disconnected vertices and then edges are added based on the groups. Small world graphs [23, 33, 44] are good examples. In one approach (WS) [44], vertices are placed in a ring and each one is connected to K/2 neighbors on both sides. Then, in a clockwise loop on vertices, an existing edge between its  $i_{th}$  neighbor is rewired with a uniform probability of P for K/2 times. As shown in Figure 3c, a graph with WS algorithm tends to form a single-chain structure if P is small. With a larger P, the structure becomes similar to ER.
- (4) Distance Probability: In distance probability (DP), initially, a pre-defined grouping is performed on disconnected vertices, then a distance probability function defines the existence of an edge. We first arrange the vertices in a ring. Then, the probability of adding an edge between two vertices is dependent on their distance. In other words, closer vertices have a higher probability of getting edges. Distance Metrics: We define distance d as the smallest number of nodes plus one between two nodes in a ring. The maximum distance can be half of the total number of nodes, which is N/2. We use the distance to re-scale the passed in probability P presented in WS. We use exponential re-scaling function:

$$P_{\text{new}} = \alpha P^{\beta d},\tag{1}$$

in which  $\alpha$  and  $\beta$  are constants. The probability quickly decreases as the distance increases. This mechanism naturally creates multiple locally strongly connected graphs, Figure 3d, which can be distributed. However, we still need to examine the distribution and concurrency opportunities, which are presented in §3.3.

#### 3.2 Transformations

Transformations are operations, the main objective of which is to create a reasonable architecture, that happens after the construction of the DAG. We first



Fig. 3. Network Generators – Four examples of different random graph generators. Note that only (d) produces a good concurrent balanced graph.



Fig. 4. Building Blocks – Building blocks used for conversion from DAG to DNN.

| Dataset                     | Baseline | DNNs with Uniform<br>Channels |
|-----------------------------|----------|-------------------------------|
| Cifar-10 32×32              | 80.70    | 81.13                         |
| Flower-102 224 $\times$ 224 | 87.80    | 74.73 (Fails to Scale!)       |

Table 1. Accuracy of Uniform Channels – The mean accuracy comparison between sampled group architectures with uniform channel vs. handcrafted without any advanced optimizations. (baselines Cifar-10 and Flower-102 are vanilla CifarNet and ResNet-50, respectively).

introduce the building blocks, which include a scaling building block that, contrary to previous work, does not enforce a single-chain dependency.

**Building Block:** During the process of transforming a DAG to DNN, vertices are interpreted as basic building blocks, as shown in Figure 4. Inside a basic building block, Sigmoid activations are applied on inputs, then, the activations are summed with a learnable weighted sum. The Sigmoid function is used to avoid weighted sum overflow. As described before, the **conv** block consists of a ReLU,  $3\times3$  separable convolution, and batch normalization.

Redefining Staging: Staging is deemed to be necessary for all NAS generated architectures to reduce the computation and facilitate learning. For staging, after a few layers, usually, the common method is to gather and merge outputs from all transformation vertices, conduct downsampling, and channel upsampling. However, such staging points create a rigid architecture with single-chain dependencies that are hard to distribute and execute concurrently (e.g., [48]). To address the single-chain bottleneck problem caused by staging, the first solution is implementing a uniform channel size for the entire architecture. In other words, all conv blocks share the same filter size. Thus, there would be no need to merge and synchronize at a point during an inference. However, as shown in Table 1, the uniform channel size approach works well on a small image dataset (e.g., Cifar-10), but it fails to achieve good accuracy on a dataset with larger image dimension (e.g., Flower-102).

In this paper, we propose individual staging after any conv block. Because of that, inputs to a conv block could have different dimensions. To tackle this problem, we dynamically add a new scaling block in the process of construction. The scaling block consists of a number of maxpooling layers. Maxpooling layers downsamples the dimensions to match with the smallest dimension in the input.

| Staging/Samples | <b>A</b> | В     | $\mathbf{C}$ | Overall Mean |
|-----------------|----------|-------|--------------|--------------|
| Greedy          | 82.30    | 81.32 | 82.42        | 82.01        |
| Probabilistic   | 82.42    | 86.69 | 84.62        | 84.58        |

**Table 2. Average Accuracy** – Comparison of randomly sampled group of generated architectures with different staging choices (trained on Flower-102).

| Staging/Sample | s <b>A</b> | В    | $\mathbf{C}$ | Overall Mean |
|----------------|------------|------|--------------|--------------|
| Greedy         | 2.31       | 2.27 | 2.63         | 2.40         |
| Probabilistic  | 3.00       | 3.28 | 3.58         | 3.29         |

**Table 3. Average Accuracy/Parameters Ratio** – Comparison of randomly sampled generated architectures with different staging choices (trained Flower-102).

We also use  $1\times1$  convolution layers to upsample the channel size to match the highest channel size in the inputs in these scaling blocks. Therefore, we avoid bottlenecks in generated architecture.

We adopted two design choices for the staging mechanism. In the first design, greedy-based staging, we start with greedy-based staging. Within the construction process, we set an upper limit for channel size. As long as channel sizes have not reached the upper bound, we conduct staging (i.e., downsample the input & upsample the channel). However, this design raises an issue that intermediate outputs are quickly squeezed through the maxpooling layer, which discards important features. This approach hurts the accuracy to some extent. s In the second design, probabilistic-based staging, we use a probabilistic method in staging. In this design, although the channel size may have not reached the limit, staging is done with a fixed probability of 0.5 to avoid discarding features too quickly. As shown in Tables 2 and 3, the probabilistic approach achieves better accuracy rate than the greedy-based approach. In addition, Table 3 shows that probabilistic staging supports higher accuracy with less parameter size because (i) probabilistic staging gracefully discards features, so the architecture learns better; and (ii) the aggressive greedy-based staging creates more size mismatch, so it requires more scaling blocks.

#### 3.3 Concurrency & Distribution

Our goal in this paper is to inspire concurrent architecture designs to improve inference latency performance. As a result, besides common accuracy consideration, we need to study concurrency and distribution opportunities of a candidate architecture. To help the community to extend our study, instead of focusing and showcasing on a single architecture, we are interested in finding a customized concurrency score (CS) for a given architecture,  $\mathcal{N}$ , that is easily calculated. In this way, we can study various architectures and future works that can further improve this work. CS shows how optimal the concurrent and distributed task assignment for an architecture is. Lower PS score represents fewer communications, better load-balanced tasks, and more distribution opportunities with more overlapped computation, so the architecture is more efficient for concurrency. **Metrics in The Score:** We can formulate our problem of allocating tasks on n units as a multi-constraint problem. The first constraint is that all units



Fig. 5. Overlapped of Computation Metric – Illustration of  $\eta$ .

should perform the same amount of work, or be load balanced. Second, the communication amount, the main bottleneck in distribution, should be at a minimum. And third, we want to minimize runtime by increasing overlapped computations among the units. The first two constraints are addressable by finding a set of hypergraph partitions, in which we divide the vertices into equally weighted sets so that few hyper-edges cross between partitions. The derivable metric is the amount of variability in loads ( $\delta_W$ ) and a total of communication ( $\Lambda$ ). The third constraint is measurable by finding the longest path between the input and output vertices on the DAG and quantify concurrency ( $\eta$ ). For instance in pipeline parallelism, the longest path is the entire architecture, as a result the latency is never reduced (and throughput is increased). Now, we provide the formal definition of these solutions by first studying the DAG.

Maximizing Overlapped Computations: We measure how overlapped is the inter-layer computations of an architecture from its DAG, or  $\eta$ , as a raito. We measure this by observing the longest path in the distinct paths between input and output vertices in the DAG,  $\mathcal{G}$ , relative to the number of the computation cores, n. Assume  $\{d_i\}$  is the set of distinct longest paths in  $\mathcal{G}$ . We define  $\eta$  as

$$\eta = \frac{\max\{d_i\}}{|\mathcal{V}|/n},\tag{2}$$

in which  $|\mathcal{V}|$  is the total number of vertices. Figure 5 depicts an examples of  $\eta$ . A higher  $\eta$  value shows a more limited opportunity to overlap the computation. Figure 5 also shows the width ofthe overlapped computation at the same depth (*i.e.*, DFS depth with the source of input), which is a good representation of why some architectures are more efficient for concurrency.

Hypergraph Representation: Using graph representations in task assignment for distributed computing is a well-known problem [18]. Basically, in the generated DAG, vertices of the graph represent the units of computations, and edges encode data dependencies. We can indicate the amount of work and/or data, by associating weights (w) and costs  $(\lambda)$  to vertices and edges, respectively. However, a DAG representation does not sufficiently capture the communication overhead, load balancing factor, and the fact that some edges are



Fig. 6. Calculating Concurrency Score – Summarizing steps for deriving the score.

basically sending the same data/features. Therefore, for task assignment, we use an alternative graph representation, derivable from the DAG, hypergraph. A hypergraph [4] is a generalization of a graph, in which an edge can join any number of vertices [46]. The hypergraph representation, common in optimization for integrated circuits [27], enables us to consider the mentioned factors.

Formal Definition of Hypergraph: A hypergraph  $\mathcal{H} = (\mathcal{V}, \mathcal{E})$  is defined as a set of vertices  $\mathcal{V}$  and a set of hyper-edges  $\mathcal{E}$  selected among those vertices. Every hyper-edge  $e_j \in \mathcal{E}$  is a subset of vertices, or  $e_j \subseteq \mathcal{V}$ . The size of a hyper-edge is equal to the number of vertices.

**Hypergraph Partitioning:** We assign weights  $(w_i)$  and costs  $(\lambda_j)$  to the vertices  $(v_i \in \mathcal{V})$  and edges  $(e_j \in \mathcal{E})$  of the hypergraph, respectively.  $\mathcal{P} = \{V_1, V_2, V_3, ..., V_P\}$  is a P-way partition of  $\mathcal{H}$  if (i)  $\forall V_i, \emptyset \neq V_i \subset \mathcal{V}$ , (ii) parts are pairwise disjoint, and (iii)  $\bigcup \mathcal{P} = \mathcal{V}$ . A partition is balanced if  $W_p \leq \varepsilon W_{\text{avg}}$  for  $1 \leq p \leq P$ , where  $W_{\text{avg}} = \sum_{v_i \in \mathcal{V}} w_{v_i}/P$  denotes the weight of each part, and  $\varepsilon$  represents the imbalance ratio, or  $\delta_W$ .

In a partition  $\mathcal{P}$  of  $\mathcal{H}$ , a hyper-edge that has at least one vertex in a part is said to connect that part. The number of connections  $\gamma_j$  of a hyper-edge  $e_j$  denotes the number of parts connected by  $e_j$ . A hyper-edge is a cut if  $\gamma_j > 1$ . We define such hyper-edges as an external hyper-edges  $\mathcal{E}_E$ . The total communication for  $\mathcal{P}$  is

$$\Lambda = \sum_{e_j \in \mathcal{E}_E} \lambda_j (\gamma_j - 1). \tag{3}$$

Therefore, our two constraints can be defined as a hypergraph partitioning problem, in which we divide a hypergraph into two or more parts such that the total communication is minimized, while a given balance criterion among the part weights is maintained. We can solve this NP-hard [27] problem with multiparadigm algorithms, such as hMETIS [21] relatively fast. Note that solving this problem is a pre-processing step, which does not affect runtime.

Concurrency Score: Now, we have the tools to calculate the concurrency score, CS. Figure 6 summarizes all the steps to derive our metrics: Load variability,  $\delta_w$ ; total amount of communication,  $\Lambda$ ; and overlapped computations,  $\eta$ . Hypergraph algorithm accepts the number of units and a higher bound of  $\varepsilon$ . By changing the  $\varepsilon$ , we create a set of partitioning options, for each of which we compute all the metrics. Note that the DAG input requires a weight and cost value for

every vertex and edge, respectively. Both of these values are easily derivable. The weight of a vertex is directly proportional to its floating operations (FLOPs), reported by most frameworks. The cost of an edge is directly proportional to the transferred data size. To get CS, first, we need to normalize the communication metric. We write  $\Lambda$  as  $\Lambda' = \Lambda/(U_c \times n)$ , in which  $U_c$  is a unit of data and n is the number of units. We define

$$CS = \sqrt[1/3]{\delta_w^a \Lambda'^b \eta^c}, \tag{4}$$

as a custom concurrency score, in which a, b and c are constant that show the relative importance of each metric for a user. In this paper, we assume a = c = 1 and b = 1.5, for a higher priority for communication. We chose  $U_c$  as the smallest amount of communication for an edge in a generator. Hence, a higher CS value shows poor distribution and concurrency opportunities.

# 4 Experimental Analysis

In this section, we evaluate our generated architectures by comparing our customized generator and transformation process with prior work. The results demonstrate that our generated architectures preserves accuracy while achieving better concurrency scores by removing the implicit bias of single-chain dependency. Besides, by running the final architecture on actual devices, we show that the concurrency score provides reasonable heuristic about the real performance.

# 4.1 Experimental Setup

Generators: All generators use probabilistic scaling blocks. FB represents prior work in unbiased NAS with staging blocks [48]. As mentioned in §3.1, although ER, BA, and WS generators are based on [48], we remove the staging block that causes the limited concurrency. As a result, all the studied network generators and resulted architectures are novel and have never been studied before.

Randomization: To evaluate the accuracy of randomly generated architecture, we collect representative samples with *no optimized search*. we followed the same training procedure for architectures and reported the average accuracy. For CS, total communication, and computation time evaluations, we collect 1,000 samples with no optimized search and compare across different generators.

Datasets: We conducted experiments on multiple datasets to ensure the extensibility of concurrent architectures. We use two image classification datasets; (i) Cifar-10 [25], which contains 60K 32×32 images in 10 classes; and (ii) Flower-102 [34], which contains 16K 224×224 images in 102 classes. We strongly encourage future extensive studies on larger datasets, but given the heavy-compute bound of NAS-based experiments, we chose to use representative datasets studied in most of the prior works [47].

**Training Procedure:** We use a uniform training pipeline with a stochastic gradient descent optimizer for all architectures. We train on Cifar-10 with 100 epochs and on Flower-102 with 300 epochs. We report the top-1 classification



Fig. 7. Total Communication with Distribution – Measured communication in MB for 1000 sampled architectures in each category for 40 vertices on {4,6,8,10} units.



Fig. 8. Inference Time – Normalized inference time normalized to FB (§4.1) for 1000 sampled architectures in each category for 40 vertices on {4,6,8,10} units.

accuracy on the test sets. For the first 100 epochs, we set the learning rate to be 1e-3 and momentum to be 0.9. We changed the learning rate to 5e-4 and momentum to 0.95 for the remaining 200 epochs on Flower-102.

**Implementation:** We implemented all graph representations in Python NetworkX [15] library. Then, we convert a graph to a PyTorch [35] compatible model. We constructed a graph-based forwarding path in PyTorch module class to directly reproduce the graph structure.

#### 4.2 Experiments

We analyze the results from three perspectives, communication, latency, and concurrency score. Because we are interested in finding a general solution, we start with the architecture stability evaluation that particularly focuses on the architecture parameter size. Then, we show the generated architectures achieve competitive accuracies, while, in the last part, we illustrate the high concurrency and distribution opportunities of these architectures.

#### **Architecture Stability:**

For the architecture stability experiment, we used a fixed number of 40 building blocks. We created 1,000 samples from each network generator. We recorded mean and standard deviation regarding the parameter sizes. We also evaluate the architecture stability under different staging design choices (greedy vs probabilistic). From Table 4, we see that proposed generators with greedy scaling blocks creates larger but more stable architectures than with probabilistic scaling blocks. Additionally, we see that our proposed DP generator creates the most



Fig. 9. Concurrency Scores – Measured CS for 1000 sampled architectures in each category with {40,80} vertices on {4,6,8,10} units (§4.1).

efficient architecture. We will see that architectures that use DP generators are generally the most optimized.

#### **Accuracy Study:**

Here, we demonstrate that the concurrent architectures achieve competitive accuracy on both Cifar-10 and Flower-102 datasets. Given the heavy-compute bound of NAS-based experiments, we encourage further studies on larger datasets. We used the same architecture samples as before without any optimized search and reported both mean and best results. As shown in Table 5 and 6, our concurrent architectures achieve comparable accuracy on both datasets. Generated DNNs achieve better or similar accuracy on Cifar-10. For Flower-102, because both network generation and transformation processes have more randomness, the mean accuracy has a small gap compared to the baseline. However, the best accuracy is close to the baseline, so we believe the accuracy gap can be leveraged by conducting an optimized search in terms of accuracy.

# Concurrency Study:

Finally, to show improved distribution and concurrency opportunities, we examined the concurrency score of our architectures to ResNet-50 and FB (§4.1) by sketching width/depth histograms in Figure 10. As shown, we achieve higher width/depth, which enables more concurrency, while provides lower maximum

|               |                | $\mathbf{E}\mathbf{R}$ | $\mathbf{AB}$ | $\mathbf{W}\mathbf{S}$ | $\mathbf{DP}$ |
|---------------|----------------|------------------------|---------------|------------------------|---------------|
| Greedy        | Mean           | 48.63                  | 48.33         | 42.03                  | 35.03         |
| Staging       | $\mathbf{Std}$ | 1.11                   | 0.91          | 1.28                   | 2.25          |
|               |                |                        |               |                        |               |
| Probabilistic | Mean           | 46.03                  | 45.63         | 36.44                  | 26.69         |
| Staging       | $\mathbf{Std}$ | 2.70                   | 4.41          | 3.52                   | 3.05          |

**Table 4. Parameter Size Stability** – The mean and standard deviation of parameter size in sampled generated architectures with different staging.

|                        | Mean  | Best  | Mean        | Best        |
|------------------------|-------|-------|-------------|-------------|
|                        | Acc.  | Acc.  | Acc./Param. | Acc./Param. |
| CifarNet               | 80.70 | 80.70 | 5.38        | 5.38        |
| $\mathbf{E}\mathbf{R}$ | 81.33 | 81.81 | 4.94        | 5.03        |
| $\mathbf{B}\mathbf{A}$ | 80.29 | 81.66 | 4.81        | 4.92        |
| WS                     | 79.89 | 81.45 | 4.75        | 4.84        |
| DP                     | 80.87 | 82.47 | 4.81        | 4.90        |

Table 5. Concurrent Architectures on Cifar-10 – Overall sampled metrics.

|                        | Mean  | Best  | Mean        | Best        |
|------------------------|-------|-------|-------------|-------------|
|                        | Acc.  | Acc.  | Acc./Param. | Acc./Param. |
| ResNet-50              | 87.80 | 87.80 | 3.43        | 3.43        |
| $\mathbf{ER}$          | 84.88 | 86.20 | 2.11        | 2.43        |
| $\mathbf{B}\mathbf{A}$ | 82.91 | 84.62 | 2.41        | 2.91        |
| WS                     | 81.46 | 86.57 | 3.17        | 3.10        |
| DP                     | 84.66 | 86.69 | 3.19        | 3.28        |

Table 6. Concurrent Architects on Flower-102 – Overall sampled metrics.

depth, which enables shorter execution time. To quantitatively compare the generators and FB, Figure 9 depicts concurrency scores, summarized on over 1000 architectures in each category per set. As seen, our generators (and specifically DP) consistently gain the best score. Moreover, to gain more insights, Figure 7 and 8 illustrate total communication with distribution and inference (i.e. computation) time, when each architecture is deployed on  $|\mathcal{P}|$  units. We see that though ER and BA methods deliver better computation speedup, they suffer performance slow down more from data communication. For our new generator, DP, we see an 6–7x speedup in inference time. We observe a close relationship between the reported score and actual latency and communication. In fact, latency and communication measure performance in an orthogonal way, but CS score captures the overall efficiency of the generated architecture pretty well and could be used in future studies.

# 5 Conclusion

In this work, we proposed concurrent architectures that break the single-chain of dependencies, a common bias in modern architecture designs. We showed that these architectures are concurrent and have more distribution opportunities for



Fig. 10. Width/Depth Histograms – Illustration of ResNet50, FB, and concurrent architectures, which enable more concurrency and shorter inference latency.

reducing the inference time while achieving competitive accuracy. Since we discover that previous NAS studies were implicitly biased in creating a sequential model, we introduced a new generator that naturally creates concurrent architectures. To quantitatively compare concurrent architectures, we proposed the concurrency score that encapsulates critical metrics in distribution.

#### References

- Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Reviews of modern physics 74(1), 47 (2002)
- Anwar, S., Hwang, K., Sung, W.: Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13(3), 32 (2017)
- 3. Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167 (2016)
- 4. Catalyurek, U.V., Aykanat, C.: Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on parallel and distributed systems **10**(7), 673–693 (1999)
- Chollet, F.: Xception: Deep learning with depthwise separable convolutions. arXiv preprint (2016)
- Courbariaux, M., Bengio, Y., David, J.P.: Training deep neural networks with low precision multiplication. arXiv preprint arXiv:1412.7024 (2014)
- 7. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or- 1. arXiv preprint arXiv:1602.02830 (2016)
- 8. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q.V., et al.: Large scale distributed deep networks. In: NIPS'12. pp. 1223–1231. ACM (2012)
- 9. Erdős, P., Rényi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci **5**(1), 17–60 (1960)
- 10. Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115 (2014)
- 11. Hadidi, R., Cao, J., Ryoo, M.S., Kim, H.: Distributed perception by collaborative robots. IEEE Robotics and Automation Letters (RA-L), Invited to IEEE/RSJ International Conference on Intelligent Robots and Systems 2018 (IROS) **3**(4), 3709–3716 (Oct 2018). https://doi.org/10.1109/LRA.2018.2856261
- 12. Hadidi, R., Cao, J., Ryoo, M.S., Kim, H.: Towards collaborative inferencing of deep neural networks on internet of things devices. IEEE Internet of Things Journal (2020)
- 13. Hadidi, R., Cao, J., Woodward, M., Ryoo, M.S., Kim, H.: Musical chair: Efficient real-time recognition using collaborative iot devices. arXiv preprint arXiv:1802.02138 (2018)
- Hadidi, R., Cao, J., Xie, Y., Asgari, B., Krishna, T., Kim, H.: Characterizing the deployment of deep neural networks on commercial edge devices. In: 2019 IEEE International Symposium on Workload Characterization (IISWC). pp. 35–48. IEEE (2019)
- 15. Hagberg, A., Swart, P., S Chult, D.: Exploring network structure, dynamics, and function using networkx. Tech. rep., Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (2008)

- Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In: 4th International Conference on Learning Representations. ACM (2016)
- 17. Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U., Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A., et al.: Applied machine learning at facebook: A datacenter infrastructure perspective. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). pp. 620–629. IEEE (2018)
- Hendrickson, B., Kolda, T.G.: Graph partitioning models for parallel computing. Parallel computing 26(12), 1519–1534 (2000)
- 19. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML'17. pp. 448–456. ACM (2015)
- Kang, Y., Hauswald, J., Gao, C., Rovinski, A., Mudge, T., Mars, J., Tang, L.: Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. In: 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems. pp. 615–629. ACM (2017)
- Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: applications in vlsi domain. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 7(1), 69–79 (1999)
- 22. Kim, J., Park, Y., Kim, G., Hwang, S.J.: Splitnet: Learning to semantically split deep networks for parameter reduction and model parallelization. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1866–1874. JMLR. org (2017)
- 23. Kleinberg, J.: The small-world phenomenon: An algorithmic perspective. Tech. rep., Cornell University (1999)
- Köster, U., Webb, T., Wang, X., Nassar, M., Bansal, A.K., Constable, W., Elibol, O., Gray, S., Hall, S., Hornof, L., et al.: Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In: Advances in Neural Information Processing Systems (NIPS). pp. 1742–1752 (2017)
- 25. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 (canadian institute for advanced research)
- Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: 26th Annual Conference on Neural Information Processing Systems (NIPS). pp. 1097–1105. ACM (2012)
- 27. Lengauer, T.: Combinatorial algorithms for integrated circuit layout. Springer Science & Business Media (2012)
- 28. Li, F., Zhang, B., Liu, B.: Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)
- Lin, D., Talathi, S., Annapureddy, S.: Fixed point quantization of deep convolutional networks. In: International Conference on Machine Learning. pp. 2849–2858 (2016)
- 30. Lin, J., Rao, Y., Lu, J., Zhou, J.: Runtime neural pruning. In: Advances in Neural Information Processing Systems (NIPS). pp. 2181–2191 (2017)
- 31. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 19–34 (2018)
- 32. Mao, J., Chen, X., Nixon, K.W., Krieger, C., Chen, Y.: Modnn: Local distributed mobile computing system for deep neural network. In: 2017 Design, automation and Test in eurpe (Date). pp. 1396–1401. IEEE (2017)
- 33. Newman, M.E., Watts, D.J.: Renormalization group analysis of the small-world network model. Physics Letters A **263**(4-6), 341–346 (1999)

- 34. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: Proc. of ICVGIP (2008)
- 35. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017), https://pytorch.org
- 36. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: Xnor-net: Imagenet classification using binary convolutional neural networks. In: ECCV'16. pp. 525–542. Springer (2016)
- Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image classifier architecture search. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 4780–4789 (2019)
- Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779–788 (2016)
- Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations. ACM (2015)
- 40. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
- 41. Tan, M., Chen, B., Pang, R., Vasudevan, V., Le, Q.V.: Mnasnet: Platform-Aware Neural Architecture Search for Mobile. arXiv preprint arXiv:1807.11626 (2018)
- 42. Teerapittayanon, S., McDanel, B., Kung, H.: Distributed deep neural networks over the cloud, the edge and end devices. In: 37th IEEE International Conference on Distributed Computing Systems (ICDCS). pp. 328–339. IEEE (2017)
- 43. Vanhoucke, V., Senior, A., Mao, M.Z.: Improving the speed of neural networks on cpus. In: Proceeding Deep Learning and Unsupervised Feature Learning NIPS Workshop. vol. 1, p. 4. ACM (2011)
- 44. Watts, D.J.: Networks, dynamics, and the small-world phenomenon. American Journal of sociology **105**(2), 493–527 (1999)
- 45. Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in neural information processing systems. pp. 2074–2082 (2016)
- 46. Wikipedia: Hypergraph. https://en.wikipedia.org/wiki/Hypergraph (2019), [Online; accessed 12/11/19]
- 47. Wistuba, M., Rawat, A., Pedapati, T.: A survey on neural architecture search. arXiv preprint arXiv:1905.01392 (2019)
- 48. Xie, S., Kirillov, A., Girshick, R., He, K.: Exploring randomly wired neural networks for image recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1284–1293 (2019)
- 49. Yu, J., Lukefahr, A., Palframan, D., Dasika, G., Das, R., Mahlke, S.: Scalpel: Customizing dnn pruning to the underlying hardware parallelism. In: 44th International Symposium on Computer Architecture (ISCA). pp. 548–560. IEEE (2017)
- 50. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning (2016)
- Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8697–8710 (2018)

# 6 Appendix



Fig. 11. Random Neural Network Distribution – This gives 5 examples of raw random generated neural networks, their distributions on two, four and eight units.

#### 6.1 Distribution

To distribute the generated networks according to the number of units, we first group node in the same sequential path together to minimize the communication overhead. The detailed algorithm of grouping can be found in ??. After the nodes in the graph are grouped together, we use heuristic-based greedy algorithm ?? to distribute all nodes to units. The objective of the algorithm is to balance the workload. To make the load balancing simple, we assume the final goal is that each unit performs a similar amount of computations. Ultimately, this process can be improved using various other techniques that currently is out of the scope of this paper. Here, we provide an example of our process, which starts from network generation to workload distribution.

**Network Generation** Figure 11 demonstrates a example of raw random neural network generated. This network is later fed into a grouping and distribution algorithm to decide which unit runs which nodes.

**Distribution to 2,4 and 8 Units** Figure 11 shows network distribution on 2,4 and 8 units. The coloring marks the node is distributed on which unit. Because all units need to run the computations of the first node, we leave it as a common node (this could be just a scatter operation). In addition, for the last node, an extra unit is needed to merge all results together, so we mark that unit as black (this could be just a gather operation).

Load Balancing From the graphs, we observe that the current grouping and distribution algorithm does well load balancing under the scenario with a small number of units. The quality of load balancing affects the final inference latency, because the final results may slow down due to a bottleneck node, which happens when unbalanced loads exist. We conduct a load balance quality study as



Fig. 12. Load Balance Quality – The load balance quality analysis on two, four, six and eight units compared to the normalized Shannon entropy value.



Fig. 13. Performance Scaling – the random neural network latency on two, four, and eight distribution units.

well as shown in Figure 12. We use normalized Shannon entropy value to indicate the load balancing quality (the higher the number represents the load is more balanced, and 1 means the load is perfectly balanced across distribution units). In the Figure 12, we showcase the median, 25%-75% percentile, and 1%-99% percentile load balancing qualities. We observe that as the number of distribution units increases, the overall load balancing quality downgrades and the variation of quality increases. We aim to develop distribution algorithms with higher quality; however, currently, our aim in this paper is showing that parallel inference computations of a single request is a viable option and should be studied more.

Performance Scaling As the final step, we also conduct a study on performance scaling. We use a total of 10 AWS t2.micro EC2 instances for performance evaluation. Each instance is equipped with only 1 vCPU and 1 GB memory. The specification are chosen to emulate edge units with limited compute and memory that have a higher computational cost (remember that constants in the Equation 4 give higher priority to communication). As shown in Figure 13, the inference latency improves when the system has more distribution units. However, The latency stops to decrease as the number of distribution units becomes 8, because the workload is not well balanced on each unit, as shown in our load balancing study. In this example, the bottleneck unit in the system causes longer latency for the entire system.