To view PDF files

You need Adobe Reader 7.0 or later in order to read PDF files on this site.
If Adobe Reader is not installed on your computer, click the button below and go to the download site.

Regular Articles

Vol. 24, No. 4, pp. 53–63, Apr. 2026. https://doi.org/10.53829/ntr202604ra1

Resource Allocation with Heterogeneous Resources and Parallelism in Disaggregated Computing

Narunori Ebara, Saki Hatta, Hikaru Uchidate,
Shoko Ohteru, Shuhei Yoshida, and Hiroyuki Uzawa

Abstract

Disaggregated computing improves resource utilization by pooling central processing units, memory, and accelerators and flexibly assigning heterogeneous resources to each service component. To maximize these benefits, resource allocation and routing must be decided efficiently before execution. This article introduces a practical-time resource allocation method that models heterogeneous resource characteristics and parallel processing effects. Simulations in heterogeneous disaggregated systems show that this method meets service-performance requirements while reducing required resources by 28–51% on average compared with conventional methods.

Keywords: disaggregated computing, resource allocation, hardware acceleration

PDF PDF

1. Introduction

The pace of performance improvement in general-purpose central processing units (CPUs) has slowed in the post-Moore’s-law era, while the demand for computation has been increasing rapidly due to the spread of artificial intelligence (AI). To sustain performance growth with this gap, modern systems increasingly rely on specialized accelerators—most notably graphics processing units (GPUs) and field-programmable gate arrays (FPGAs)—that can execute particular workloads more efficiently than CPUs. At the same time, AI models continue to scale in size and complexity [1], and many practical deployments require multiple accelerators to satisfy throughput and latency targets. This creates a strong need for system architectures that can combine heterogeneous computational resources and use them efficiently.

To support heterogeneous and accelerator-centric workloads, datacenter operators have been increasingly interested in disaggregated computing*1, where CPUs, memory, and accelerators are pooled and flexibly composed through high-performance interconnects [2–5]. NTT’s Innovative Optical Wireless Network (IOWN)*2 initiative introduces Data-Centric Infrastructure (DCI)*3 as a future direction for such architectures [2, 3]. In this article, we specifically focus on resource allocation in disaggregated computing systems, which serves as a key technical foundation that can contribute to the actualization of DCI. Figure 1 shows the disaggregated-computing system model considered in this article; heterogeneous resources are grouped into resource pools and interconnected via internal and external transmission paths [2–4]. Resource pools contain resources of the same type (e.g., CPUs, GPUs, FPGAs), and the resources in each pool are connected through these transmission paths. Resource pools that are physically close to each other are connected via internal transmission paths, whereas those that are physically far from each other are connected via external transmission paths. We call a cluster of resource pools connected via internal transmission paths a block. This hierarchical organization reflects realistic datacenter deployments and is important because link characteristics (bandwidth, latency, and contention) differ between internal and external paths, directly affecting end-to-end service performance.


Fig. 1. Disaggregated computing system.

In the real world, services are provided across multiple domains (e.g., database queries, compression, encryption, video coding, signal processing, and conventional machine learning) [6]. In this context, the service is composed of an ordered chain of processing requests, each of which we call a virtual function request (VFR)*4. In disaggregated computing, the flexible utilization of multiple types of computational resources, such as CPUs, GPUs, and FPGAs, enables the assignment of each VFR to the most appropriate computational resource. This approach ideally allows for effective service execution across heterogeneous computational resources. Note that we define a virtual function (VF)*5 as a virtualized dedicated functional unit within a computational resource, which can execute the VFR. A key architectural issue in executing such chains is data movement. Conventional designs often rely on the CPU to manage data transfer between resources, which can create a communication bottleneck because traffic concentrates on the CPU and its associated interconnect. In contrast, recent approaches enable computational resources to transfer data autonomously and directly by bypassing the CPU in the data-transfer path [6–9]. Therefore, in disaggregated computing, it is desirable to efficiently execute services that conduct a processing chain by enabling direct communication between computational resources [10, 11].

In the context of executing processing chains in disaggregated computing, enabling users to freely deploy services introduces significant challenges in resource allocation. For example, users may unintentionally select computational resources far exceeding service requirements or may select resources that are poorly located from a topology perspective. This can lead to inefficiencies and spatial imbalances, where some resources become over-utilized while others remain under-utilized. Such imbalances degrade service performance and increase cost, especially when expensive accelerators are involved. To address these issues, it is essential to carefully allocate resources and determine routing to minimize the number of computational resources while meeting service-performance requirements.

Some methods [12, 13], many originating from cloud networking and virtual-network-function placement, account for inter-resource communication and optimize objectives, such as utilization [12] and power [13], under delay constraints. However, these methods are primarily designed for homogeneous resources (e.g., CPUs or servers). When applied to heterogeneous resources, such as GPUs, FPGAs, and other accelerators, they may allocate unsuitable resources because they do not explicitly model the performance characteristics of different resource types. This limitation can lead to suboptimal performance, for example, running AI inference on a general-purpose CPU instead of a specialized GPU. Thus, effective resource allocation in disaggregated computing must explicitly account for heterogeneous resource characteristics.

Current resource allocation methods [12, 13] handle serial processing and do not support parallel processing across multiple resources. Given the increasing prevalence of high-load services, such as AI in disaggregated computing [9], it is crucial to consider parallel processing because these services cannot achieve sufficient performance with a single computational resource [14, 15]. To support parallel processing, it is imperative that services be pre-defined as parallelizable data flows; however, the appropriate degree of parallelism depends on the service requirements and on current resource and path usage. If the parallelism is chosen poorly, this approach may waste resources or degrade performance per resource. Therefore, a resource allocation method accounting for the impact of parallel processing is essential to effectively use resources in disaggregated computing.

We thus previously proposed a resource allocation method for disaggregated computing that efficiently executes chained services among heterogeneous computational resources [16]. This method first integrates the characteristics of heterogeneous resources and parallel processing to maximize efficiency while deriving solutions within a practical time frame. Through simulations, we demonstrated that this method can allocate resources efficiently while satisfying service-performance requirements within a practical amount of time.

In Section 2, we introduce the details of our resource allocation method. In Section 3, we describe the evaluation process and report the results in Section 4. We conclude in Section 5 with a summary.

*1 Disaggregated computing: A computing architecture in which CPUs, memory, accelerators, and other hardware resources are separated into pools and flexibly composed through high-performance interconnects.
*2 IOWN: A new communication infrastructure that can provide high-speed broadband communications and enormous computational resources by using innovative technologies including optical technologies.
*3 DCI: An architectural concept within IOWN that emphasizes efficient data processing by flexibly composing distributed and heterogeneous computing resources. This is a type of disaggregated computing.
*4 VFR: A processing request of a service composed of an ordered service chain. It is executed using a VF.
*5 VF: A virtualized functional unit implemented on a computational resource (e.g., CPU, GPU, or FPGA) that executes a corresponding VFR.

2. Resource allocation method

Our resource allocation method is a polynomial-time heuristic approach that integrates the impact of the characteristics of heterogeneous computational resources and that of parallel processing. The first impact is modeled by scoring VF parameters that adapt depending on the computational resources used, while the second impact is modeled by determining the appropriate parallelism through a comparison of the throughput requirement of the service and allocatable processing velocity of VFs with the calculated delay requirement.

2.1 Model

The disaggregated computing system is represented with an undirected graph G(N,E), where N n and E ∋ (n, n') denote sets of nodes and links, respectively. There are two types of nodes: computational resources and switches. The notation NC c denotes the set of computational resources and NP p denotes the set of switches (N = NCNP). A computational resource refers to an XPU (CPU, GPU, FPGA, and other accelerators) on which VFs are implemented, and a switch refers to two types of a device that connects resources within a network: internal (Peripheral Component Interconnect Express (PCIe)) and external (Ethernet) switches. The notation F ∋ ƒ denotes a set of VFs that execute VFRs with computational resources. Several parameters of a VF will vary depending on the computational resources on which it is implemented.

The notation Ss denotes a set of services composed of an ordered chain of VFRs. The performance requirements of the services are defined by throughput requirements and delay requirements . The notation Rsr denotes a set of VFRs for service s.

2.2 Heuristic approach

Resource allocation problems are generally NP (nondeterministic polynomial time)-hard, which indicates that finding an optimal solution within a reasonable timeframe is computationally challenging since it is transformed into the capacitated facility location problem, which is known to be NP-hard [17]. To address this issue, we developed a polynomial-time heuristic approach. Our heuristic approach is greedy in that it simultaneously determines the allocation and routing for each VFR. This heuristic approach explores potential solutions by executing provisional allocation and routing in advance and then determines the allocation and routing of the near-optimal solution by scoring to account for the impact of the characteristics of heterogeneous computational resources. By proactively exploring solutions that satisfy constraints, this approach significantly reduces the solution space, enabling the time-efficient allocation of VFRs. It also facilitates the appropriate determination of parallelism by comparing the throughput requirement of the service and the allocatable processing velocity of VFs with the delay requirement calculated during the provisional placement and connection phases.

Algorithm 1 shows the processing of this heuristic approach. First, it sorts the VFRs of the service. A VFR having a large processing-velocity difference is placed at the top, and VFRs with smaller indices than this initial VFR are then sorted in descending order. Next, VFRs with larger indices than the initial VFR are sorted in ascending order, and a target VFR is selected from the top of the resulting list. Procedure 1 calculates the Score, Velocity, and Path for the target VFR while the remaining throughput requirement is not 0, which means not satisfied. Score is a list of scores to identify suitable placement, Velocity is a list of allocatable processing velocities and Path is a list of the shortest paths (c, c'). Procedure 1 checks if the following two conditions are satisfied for the computational resource and if the computational resource is one that was not previously allocated for the target VFR. The first condition is that the number of VFs deployed is smaller than the maximum number of VFs in the computational resource. The second condition is that the VFR and VF deployed are only placed on the computational resources of available types . If not, the target VFR cannot be allocated to the computational resource.


Algorithm 1. Processing of heuristic approach for resource allocation.


Procedure 1. Calculation of Score, Velocity and Path.

When the above conditions are met and the target VFR is not the first VFR of the service, the shortest path (c, c') is determined using the Dijkstra method. When the previous VFR is divided, since there are several start/end points, the shortest path is calculated by all start/end points. If there is no shortest path in the allowable bandwidth, the target VFR cannot be allocated to the computational resource. If there is a shortest path, we determine the maximum allocatable processing velocity (Vmax) of the target VFR that guarantees the delay requirement of the service. The execution time of a service can be described as the sum of the maximum transmission delay of a path and the maximum processing delay of a VF. We use the M/D/1 queueing model*6 to capture the change in the processing delay with the utilization of the VF. Equation 1 shows the processing delay of the VF ƒ that executes the VFR r in computational resource c in the M/D/1 queueing model, i.e., the placement of a new VFR affects all placed VFRs.

Here, and are respectively the current and maximum processing velocities of VF ƒ in computational resource c, and is the processing delay of VF ƒ in computational resource c.

We derive Vmax from Equations 2 and 3, which facilitates the appropriate determination of parallelism.

Here, is the allowable processing delay of VF ƒ in computational resource c, g(x) is the processing velocity at processing delay x, and is the transmission delay of the virtual link (r, r') and physical link (c, c'). When there are multiple shortest paths, is the maximum transmission delay in these paths. The is the delay requirement of VFR r in service s, which is obtained by distributing the service’s delay requirement among the VFRs in proportion to their potential average processing and transmission delays, so that VFRs with inherently heavier processes are allocated more allowable delay. Since the processing delay of the VF is expressed with the M/D/1 queueing model in Equation 1, can be expressed as

Scores are calculated for the performance, future utilization, and capacity of the computational resource. Note that the performance and future utilization scores are the first to account for the characteristics of heterogeneous computational resources. The performance score is the maximum processing velocity of the VFs that can be implemented on the computational resource. The future utilization score is the utilization rate of the VF ƒ in computational resource c calculated from the M/D/1 queueing model, which is calculated as

56

This score enables the allocation to maximize the future utilization rate of the computational resource. Since the future utilization rate also reflects the transmission delay included in the allowable processing delay, this score also has the effect of suppressing this transmission delay. The capacity score is 1 when the computational resource is in use and 0 when it is not. This score also adds 1 when the VFs needed to execute the target VFR are in use and 0 when they are not. Note that the administrator can add other scores.

When the target VFR is the first VFR of the service, Vmax and the scores are calculated in the same manner as above with .

When Vmax is or more, is stored in Velocity for the computational resource, where represents the remaining throughput requirements of the target VFR. Note that is when the target VFR is not divided. When Vmax is less than , which means that the target VFR is divided and allocated across multiple computational resources in parallel, Vmax is stored in Velocity. Similarly, the scores are stored in Score, and the path is stored in Path. This is executed for all computational resources, and the scores in Score are normalized, weighted by the weight determined by the administrator, and summed to update Score. This concludes Procedure 1.

Algorithm 1 selects the computational resource of the best score in Score. When the score is available, which means there is a computational resource where a VFR can be placed, the target VFR is allocated to the computational resource. If there is no VF available to execute the target VFR in the computational resource, the VF is also allocated to the computational resource. The path of the computational resource in Path is also connected, and the processing velocity of the computational resource and are updated by Velocity. This is the process of updating the mapping results (MR). When is not 0, this process is executed until is 0. If the score is not available, which means there is no computational resource where a VFR can be placed, the previous VFR is rearranged. Finally, the MR are returned after all VFRs in the service are allocated.

*6 M/D/1 queueing model: A queueing model with Markovian (Poisson) arrivals, deterministic service time, and a single computational resource.

3. Experimental setup

We compared our method with the conventional method based on Holu [13] across various disaggregated computing systems. We first evaluated a heterogeneous system with random parameters to analyze the impact of heterogeneous computational resource characteristics and parallel processing. We then assessed the practicality of our method in a realistic heterogeneous system resembling real-world scenarios. Finally, we measured the resource allocation time to confirm whether the method produces solutions within a practical timeframe.

3.1 Simulation setup

We coded both our method and the conventional one in Python 3.10 and conducted simulations on an Intel-based personal computer with Intel Core i9-14900K 3.20 GHz (128-GB random access memory).

3.1.1 System

We assume the disaggregated computing system shown in Fig. 2: a 3 × 3 mesh topology with 144 computational resources, which is both commonly used and scalable [18]. The system consists of resource pools (8 resources each) connected via internal PCIe links and system-wide Ethernet, with top-tier switches for external network access. Six types of computational resources with distinct characteristics are symmetrically arranged. In the random system, we randomly set parameters of computational resources and VFs within the range shown in Table 1 (left). In the realistic system, based on experimental information [19] and typical real-world device information [20–25], we set the parameters shown in Table 1 (right). In the realistic system, we assume B1 consists of two GPU types [20, 21] B2 and B3, each consisting of two FPGA types [22–25], as shown in Fig. 2.


Fig. 2. Simulation configuration for disaggregated computing.


Table 1. Parameters in simulation experiments.

3.1.2 Service

In the random system, we assume that various services are randomly requested by combining two to six VFRs out of the ten types. Note that this simulation does not address resource reallocation after the initial allocation; we leave the exploration of dynamic reallocation that handles penalties or constraints during reallocation to future work. In the realistic system, based on video AI inference (AI inf.) [19], we assume that three types of services are randomly requested: 1. Decode α→Resize α→ AI inf. α, 2. Resize α→ AI inf. α, and 3. Decode α→Resize β→ AI inf. β. We set the requirements of each service at random within the range in Table 1. All services start and end at the top-tier Ethernet switch, simulating real-time data processing arriving at the datacenter via Ethernet.

3.2 Conventional method

We compared our method against the conventional method adapted from Holu [13], which was chosen for its relevance in considering inter-resource communication and maximizing resource utilization under delay constraints. This method, originally designed for homogeneous resources, decomposes the problem into placement and routing. It iteratively ranks VFs on the basis of resource centrality and utilization, allocates them to preferred resources, routes connections, and refines the allocation if delay requirements are unmet. To support parallel processing, we add a minimal extension that splits a VFR into independent parallel sub-tasks only when a single VF cannot satisfy the throughput requirement. This extension does not alter the original decision logic of the conventional method.

4. Results

For each system, we evaluated five distinct cases by independently sampling the parameters in Table 1 once per case. Since these cases correspond to different configurations rather than repeated trials under identical conditions, we do not average the raw results across cases. Instead, we report representative cases: the best case, which yields the largest improvement over the conventional method, and the worst case, which yields the smallest improvement. The results of our simulation with the random and realistic systems are shown in Figs. 3 and 4, respectively. Note that the results are reported only up to the maximum number of services that satisfy all service-performance requirements. We gradually increased the number of requested services and stopped until at least one service-performance requirement was violated.


Fig. 3. Number of used resources ((a), (b)) and average utilization of used resources ((c), (d)) for best and worst cases, respectively, in random system.


Fig. 4. Number of used resources ((a), (b)) and average utilization of used resources ((c), (d)) for best and worst cases, respectively, in realistic system.

4.1 Random system

Figures 3(a), (b) show the results for the number of used resources and Figures 3(c), (d) illustrate resource utilization for best and worst cases, respectively. To evaluate the impact of the characteristics of heterogeneous computational resources without considering the impact of parallel processing, we also evaluated our method without parallel processing. Compared with the conventional method, our method decreased the number of resources by an average of 20 to 46% and increased the average resource utilization by an average of 16 to 20%. The difference arises from whether the characteristics of heterogeneous computational resources are accounted for. The conventional method without parallel processing struggles to allocate suitable resources, leading to lower performance and the need for additional resources. In contrast, our method effectively matches resources to VFRs, reducing the number of resources. It also improves resource utilization by accounting for the provisional utilization of resources on the basis of the impact of the characteristics of heterogeneous computational resources.

Compared with the conventional method, our method with parallel processing decreased the number of resources by an average of 28 to 51% and increased the average resource utilization by an average of 29 to 35%. These results indicate that our method can further increase resource utilization and decrease the number of resources with parallel processing. The above findings clarify the importance of accounting for the characteristics of heterogeneous resources and parallel processing to maximize resource efficiency and utilization.

4.2 Realistic system

Figures 4(a), (b) show the results for the number of used resources and Figures 4(c), (d) illustrate resource utilization for best and worst cases, respectively. We evaluated the effectiveness of our method in practical cases. Compared with the conventional method, our method decreased the number of resources by an average of 41 to 49% and increased the average resource utilization by an average of 16 to 22%. These results indicate that our method will be beneficial in real-world environments.

4.3 Runtime

Figure 5 shows the distribution of the average allocation times for the 20 services in the random system, grouped by the number of services placed in the system. For each group, the box plot represents the variability of allocation time across different parameter settings, while the triangle marker indicates the average allocation time for each case. The results indicate that our method can allocate resources in all cases within ten seconds per service on average, which is considered acceptable for resource allocation before service execution. Although some variation in allocation time can be observed across different service groups, no clear increasing trend is observed as the number of placed services increases. This indicates that system usage has no effect on resource allocation time, even when the overall utilization of the system is high. These findings suggest that our method executes stable resource allocation regardless of system usage.


Fig. 5. Runtime in random system.

5. Conclusion

We introduced our resource allocation method for disaggregated computing that efficiently uses heterogeneous computational resources. The method explicitly accounts for both the performance characteristics of heterogeneous resources and the effects of parallel processing, enabling efficient execution of services through direct interconnection among heterogeneous computational resources. Simulation results on heterogeneous disaggregated systems indicate that our method reduces the number of required resources by an average of 28–51% in random systems and achieves effective resource utilization in realistic system configurations.

These results highlight the importance of jointly considering heterogeneous resource characteristics and parallel processing in resource allocation. Runtime evaluation also confirms that our method can derive allocation solutions within a practical amount of time. Overall, the results indicate that our method effectively maximizes resource utilization while satisfying service-performance requirements in disaggregated computing systems.

The insights obtained from this study are also applicable to future architectures such as IOWN DCI, where large-scale heterogeneous resource pooling and accelerator-centric communication are expected to play a key role. Our method may lead to significant cost reductions and lower energy consumption, offering a pathway toward more sustainable and economically viable datacenter operations.

References

[1] J. Sevilla and E. Roldan, “Training Compute of Frontier AI Models Grows by 4-5x per Year,” Epoch AI, May 28, 2024.
https://epoch.ai/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year
[2] A. Okada, S. Kihara, and Y. Ozaki, “Disaggregated Computing, the Basis of IOWN,” NTT Technical Review, Vol. 19, No. 7, pp. 52–57, July 2021.
https://doi.org/10.53829/ntr202107fa7
[3] IOWN Global Forum, “Data-Centric Infrastructure Cluster Reference Implementation Models,” Mar. 2025.
https://iowngf.org/wp-content/uploads/2025/03/IOWN-GF-RD-DCI_Cluster_RIM-1.0.pdf
[4] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas, D. Theodoropoulos, I. Koutsopoulos, K. Hasharoni, D. Raho, C. Pinto, F. Espina, S. Lopez-Buedo, Q. Chen, M. Nemirovsky, D. Roca, H. Klos, and T. Berends, “Rack-scale Disaggregated Cloud Data Centers: The dReDBox Project Vision,” Proc. of the 2016 Conference on Design, Automation & Test in Europe (DATE), Dresden, Germany, pp. 690–695, Mar. 2016.
[5] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han, R. Agarwal, S. Ratnasamy, and S. Shenker, “Network Requirements for Resource Disaggregation,” Proc. of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), Savannah, GA, USA, pp. 249–264, Nov. 2016.
[6] S.-T. Wang, H. Xu, A. Mamandipoor, R. Mahapatra, B. H. Ahn, S. Ghodrati, K. Kailas, M. Alian, and H. Esmaeilzadeh, “Data Motion Acceleration: Chaining Cross-domain Multi Accelerators,” Proc. of 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Edinburgh, UK, pp. 1043–1062, Mar. 2024.
https://doi.org/10.1109/HPCA57654.2024.00083
[7] C. Hwang, K. Park, R. Shu, X. Qu, P. Cheng, and Y. Xiong, “ARK: GPU-driven Code Execution for Distributed Deep Learning,” Proc. of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’23), Boston, MA, USA, pp. 87–101, Apr. 2023.
[8] Y. Ukon, T. Kawahara, Y. Arikawa, N. Miura, T. Ishizaki, W. Kanemori, R. Tamura, K. Mori, and T. Sakamoto, “Scalable Low-latency Hardware Function Chaining with Chain Control Circuit,” Proc. of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC24), Atlanta, GA, USA, Nov. 2024.
[9] K. Tanaka, Y. Arikawa, T. Ito, K. Morita, N. Nemoto, F. Miura, K. Terada, J. Teramoto, and T. Sakamoto, “Communication-efficient Distributed Deep Learning with GPU-FPGA Heterogeneous Computing,” Proc. of 2020 IEEE Symposium on High-Performance Interconnects (HOTI), Piscataway, NJ, USA, pp. 43–46, Aug. 2020.
https://doi.org/10.1109/HOTI51249.2020.00021
[10] A. Krishnakumar, U. Ogras, R. Marculescu, M. Kishinevsky, and T. Mudge, “Domain-specific Architectures: Research Problems and Promising Approaches,” ACM Trans. Embed. Comput. Syst., Vol. 22, No. 2, Art. no. 28, Mar. 2023.
https://doi.org/10.1145/3563946
[11] R. Takano and K. Suzaki, “Disaggregated Accelerator Management System for Cloud Data Centers,” IEICE Trans. Inf. Syst., Vol. 104, No. 3, pp. 465–468, Mar. 2021.
https://doi.org/10.1587/transinf.2020EDL8040
[12] Y. Yue, B. Cheng, X. Liu, M. Wang, B. Li, and J. Chen, “Resource Optimization and Delay Guarantee Virtual Network Function Placement for Mapping SFC Requests in Cloud Networks,” IEEE Trans. Netw. Service Manag., Vol. 18, No. 2, pp. 1508–1523, June 2021.
https://doi.org/10.1109/TNSM.2021.3058656
[13] A. Varasteh, B. Madiwalar, A. V. Bemten, W. Kellerer, and C. Mas-Machuca, “Holu: Power-aware and Delay-constrained VNF Placement and Chaining,” IEEE Trans. Netw. Service Manag., Vol. 18, No. 2, pp. 1524–1539, June 2021.
https://doi.org/10.1109/TNSM.2021.3055693
[14] Z. N. Rashid, S. R. M. Zeebaree, R. R. Zebari, S. H. Ahmed, H. M. Shukur, and A. Alkhayyat, “Distributed and Parallel Computing System Using Single-client Multi-hash Multi-server Multi-thread,” Proc. of the 1st Babylon International Conference on Information Technology and Science (BICITS 2021), Babil, Iraq, pp. 222–227, Apr. 2021.
https://doi.org/10.1109/BICITS51482.2021.9509872
[15] Y. Huang, “Parallel Computing and Its Applications,” Proc. of 2022 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, pp. 715–718, June 2022.
https://doi.org/10.1109/ICAICA54878.2022.9844487
[16] N. Ebara, S. Hatta, Y. Ukon, H. Uzawa, S. Ohteru, S. Yoshida, and K. Nakamura, “Resource Allocation Considering the Impact of Characteristics and Parallel Processing of Heterogeneous Computational Resources in Disaggregated Computing,” Proc. of the 49th IEEE Annual Computers, Software, and Applications Conference (COMPSAC 2025), Toronto, ON, Canada, pp. 477–482, July 2025.
https://doi.org/10.1109/COMPSAC65507.2025.00069
[17] R. M. Nauss, “An Improved Algorithm for the Capacitated Facility Location Problem,” J. Oper. Res. Soc., Vol. 29, No. 12, pp. 1195–1201, Dec. 1978.
https://doi.org/10.2307/3009584
[18] A. Ikoma, Y. Ohsita, and M. Murata, “Resource Allocation Considering Impact of Network on Performance in a Disaggregated Data Center,” IEEE Access, Vol. 12, pp. 67600–67618, May 2024.
https://doi.org/10.1109/ACCESS.2024.3399930
[19] M. Kanda, T. Watanabe, R. Tamura, and Y. Ukon, “OpenKasugai,” GitHub repository, 2025.
https://github.com/openkasugai
[20] NVIDIA Corporation, “NVIDIA A100 Tensor Core GPU.”
https://www.nvidia.com/en-us/data-center/a100/
[21] NVIDIA Corporation, “NVIDIA H100 GPU.”
https://www.nvidia.com/en-us/data-center/h100/
[22] Advanced Micro Devices, Inc., “AMD Alveo™ U250 Data Center Accelerator Card (Active).”
https://www.amd.com/en/products/accelerators/alveo/u250/a-u250-a64g-pq-g.html
[23] Advanced Micro Devices, Inc., “AMD Alveo™ U200 Data Center Accelerator Card (Active).”
https://www.amd.com/en/products/accelerators/alveo/u200/a-u200-a64g-pq-g.html
[24] Advanced Micro Devices, Inc., “AMD Alveo™ U55C Data Center Accelerator Card.”
https://www.amd.com/en/products/accelerators/alveo/u55c.html
[25] Advanced Micro Devices, Inc., “AMD Alveo™ V80 Compute Accelerator.”
https://www.amd.com/en/products/accelerators/alveo/v80.html
Narunori Ebara
Researcher, Device Innovation Center, NTT, Inc.
He received a B.E. and M.E. in applied and engineering physics from the University of Osaka in 2020 and 2022. In 2022, he joined NTT Device Innovation Center and has been engaged in research and development of disaggregated computing and network-traffic-monitoring systems. He is a member of the Institute of Electronics, Information and Communication Engineers (IEICE).
Saki Hatta
Senior Manager, Device Innovation Center, NTT, Inc.
She received a B.S. and M.S. in material engineering from Tokyo Institute of Technology (now Institute of Science Tokyo) in 2009 and 2011. In 2011, she joined NTT Microsystem Integration Laboratories, where she was engaged in research of design techniques for a network system on a chip (SoC). She is currently with the NTT Device Innovation Center and engaged in the research and development of network-traffic-monitoring systems. She was the recipient of the IEEE CEDA Young Researcher Award in 2018. She is a member of IEICE and serves on the Technical Program Committee of the Asian Solid-State Circuits Conference (A-SSCC).
Hikaru Uchidate
Senior Research Engineer, Device Innovation Center, NTT, Inc.
He received a B.E. and M.E. in electrical and electronic engineering from Saitama University in 2006 and 2008. In 2008, he joined Canon Inc., where he was engaged in developing an image-processing algorithm and FPGA-based hardware. In 2025, he joined the NTT Device Innovation Center, where he has been engaged in the research and development of network-traffic-monitoring technologies.
Shoko Ohteru
Senior Research Engineer, Device Innovation Center, NTT, Inc.
She received a B.S. in physics from Ochanomizu University, Tokyo, M.S. in physics from the University of Tokyo, and Ph.D. in engineering from Nihon University, Tokyo, in 1992, 1994, and 2011. She joined NTT Telecommunication Networks Laboratories in 1994. She is a member of IEICE.
Shuhei Yoshida
Research Scientist, Device Innovation Center, NTT, Inc.
He received a B.E. and M.E. in computer science and systems engineering from Kobe University, Hyogo, in 2014 and 2016. In 2016 he joined NTT Device Innovation Center and has been engaged in research and development on a hardware-design methodology, FPGA-based network-traffic-monitoring systems, and high-definition AI-inference engine. He is a member of IEICE.
Hiroyuki Uzawa
Director, Device Innovation Center, NTT, Inc.
He received a B.E. and M.E. in electrical engineering from Tokyo University of Science in 2006 and 2008. In 2008, he joined NTT Microsystem Integration Laboratories, where he was engaged in research and development of the design techniques for a network SoC. Since 2019, he has been engaged in research and development of a high-definition AI-inference engine at NTT Device Innovation Center. He received the 2012 IEICE Young Engineer Award and the 2021 IEICE ELEX Best Paper Award. He is a member of IEICE.

↑ TOP