Modeling and Optimization for Customized Computing: Performance, Energy and Cost Perspective Peipei Zhou Committee: Jason Cong (Chair) Ph.D. Final Defense June 10th, 2019 Glenn Reinman Jae Hoon Sul Tony Nowatzki 1 Publication Conference Publication [ICCAD 18] Y. Chi, J. Cong, P. Wei, and P. Zhou, SODA: Stencil with Optimized Dataflow Architecture, IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2018. Best Paper Candidate

[FCCM 18] J. Cong, P. Wei, C.H. Yu, and P. Zhou, Latte: Locality Aware Transformation for High-Level Synthesis, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. [FCCM 18] Z. Ruan, T. He, B. Li, P. Zhou, and J. Cong. ST-Accel: A HighLevel Programming Platform for Streaming Applications on FPGA, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018. [ISPASS 18] P. Zhou, Z. Ruan, Z. Fang, M. Shand, D. Roazen, and J. Cong, Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2 2018. Best Paper Candidate (4 out of 21 accepted) Publication Conference Publication

[DAC 17] J. Cong, P. Wei, C.H. Yu, and P. Zhou, Bandwidth optimization through on-chip memory restructuring for HLS, Design Automation Conference (DAC), 2017 [ICCAD 16] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks, IEEE/ACM International Conference on Computer-Aided Design (ICCAD) ,2016 [FCCM 16] P. Zhou, H. Park, Z. Fang, J. Cong, A. DeHon, Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016 [FCCM 14] J. Cong, H. Huang, C. Ma, B. Xiao, P. Zhou, A Fully Pipelined and Dynamically Composable Architecture of CGRA, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM),

3 2014 Publication Journal and Poster Publication [TCAD 18] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, J. Cong, Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD),2018. TCAD Donald O. Pederson Best Paper Award 2019. [DAC 18] Y. Chi, P. Zhou, J. Cong, An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism (Work-in-Progress Accept), Design Automation Conference (DAC), 2018 [FPGA 18] Y. Chi, P. Zhou, J. Cong, An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism (Abstract Only), International Symposium on Field-Programmable Gate Arrays (FPGA), 2018

[FPGA 16] Y. Chen, J. Cong, Z. Fang, P. Zhou, ARAPrototyper: Enabling Rapid Prototyping and Evaluation for Accelerator-Rich Architecture (Abstract Only), ACM/ SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2016 [HITSEQ 15] Y. Chen, J. Cong, J. Lei, S. Li, M. Peto, P. Spellman, P. Wei, P. Zhou, CSBWAMEM: A fast and scalable read aligner at the cloud scale for whole genome sequencing, High Throughput Sequencing, Algorithms and Applications (HiTSeq) 4 (Poster), 2015. Best Poster Award Accelerators CPU core scaling coming to an end CPUs are power hungry Accelerators: FPGA (Field Programmable Gate Arrays) Energy efficiency, 10-100X than CPUs

Parallelization Customization 5 Accelerator at Different Levels Single Chip Level System-on-Chip (SoC) FPGAs IoT devices, self-driving car Battery Life & Performance Matters! 6 Accelerator at Different Levels Node Level Performance &

Dollars Matter! CPU <-> FPGA accelerators through PCIE Host offload kernels onto FPGAs PCIe 7 Accelerator at Different Levels Cluster Level Performance & Dollars Matter! FPGA accelerators in various cloud services IBM Cloud, Microsoft, FPGA and GPU in 2014 Amazon AWS, Baidu, Huawei, Alibaba starting 2016 Till 2019, Nimbix, Tencent

2014: Bing web search & Image search engine 8 Accelerator at Different Levels Cluster Level Performance & Dollars Matter! Fully pipelining CPUs and FPGAs gives more cost benefit 9 Outline Introduction

Chip Level Performance and Energy Modeling Energy Efficiency of Full Pipelining (Chapter 2) Latte: Frequency and performance optimization (Chapter 3) Node Level Performance and Cost Modeling: Computation Resources (Chapter 4) Doppio, storage resources (Chapter 5) Cluster Level Performance and Cost Modeling Public Cloud Cost Optimization: Composable Compute Instance: Mocha (Chapter 6) Compute + Storage Co-optimization (Chapter 7) Private Cloud Throughput Optimization (Chapter 8) Conclusion 10

Applications: GATK Best Practice by Broad Institute Genome Pipeline Stages: Alignment: BWA-MEM (Burrows-Wheeler Aligner) Base-Quality Score Recalibration: GATK (Genome Analysis Toolkit) Calling Variants: GATK, HaplotypeCaller (HTC) to find germline variants, Mutect2 for tumor sequences Applies to single-sample whole genome sequence (WGS) or whole exome sequence (WES) analysis 11 Applications: Genome Pipeline Genome Pipeline Stages: Alignment: BWA-MEM, Smith-Waterman (FPGA accelerator) Base-Quality Score Recalibration: GATK

Calling Variants: HaplotypeCaller+Mutect2, PairHMM 12 (FPGA accelerator) Outline Introduction Chip Level Performance and Energy Modeling Energy Efficiency of Full Pipelining (Chapter 2) Latte: Frequency and performance optimization (Chapter 3) Node Level Performance and Cost Modeling: Computation Resources (Chapter 4) Doppio, storage resources (Chapter 5) Cluster Level Performance and Cost Modeling

Public Cloud Cost Optimization: Composable Compute Instance: Mocha (Chapter 6) Compute + Storage Co-optimization (Chapter 7) Private Cloud Throughput Optimization (Chapter 8) Conclusion 13 Motivation: How Does Using FPGA Accelerators Impact an Applications Out-of-Pocket Cost in Public Cloud Services? From CPU solution -> CPU + FPGA for HTC on AWS Cloud Latency improves 1.60x Cost 2.58x Motivation: How Does Using FPGA Accelerators Impact an Applications Out-of-Pocket Cost in Public Cloud Services? From CPU solution -> CPU + FPGA for Mutect2 on Huawei Latency improves 1.49x

Cost 1.16x Why? FPGA is not cheap in public cloud AWS, 1 FPGA = 25 vCPU in price Huawei, 1 FPGA = 23 vCPU in price Amdahls law the theoretical speedup is always limited by the part of the task that cannot benefit from the improvement. For HTC, PairHMM kernel takes 39% Challenges Applications are different r (kernel ratio) S (speedup)

BWA (SmithWaterman) 0.4 ~16 Samtools (deflate) 0.2 ~12x GATK3.6 (pairHMM) 0.4 ~40x BWA, Samtools, GATK3.6 are all single node, multi-thread program BWA, Samtools in C, GATK3.6 in Java 17 Performance Model on CPU-FPGA Platform

Well Matched FPGA vs CPUs e.g. r = 0.5, S = 8, P = 8 18 Performance Model on CPU-FPGA Platform Fast FPGA e.g. r = 0.39, 1-r/r = 1.56, S = 40, P = 64 P = 8, Leave 1-8/64 = 87% FPGA resource idling, Waste! 19 Performance Model on CPU-FPGA Platform Slow FPGA e.g. r = 0.83, 1-r/r = 0.2, S = 40, P = 8 P = 32, Leave 32-8 = 24 CPU cores idling, Waste! 20 Proposed Optimization for Fast FPGA: Sharing FPGA among Instances

Matching throughput of CPUs and FPGAs Network network network 21 Slow FPGA: FPGA is Bottleneck FPGA offload CPU operation P, executor-core P=1 1. P increases, runtime decreases 2. FPGA becomes bottleneck Effective CPU cores remains same (8) Runtime remain same for P or larger (P>8) 22

Proposed Optimization for Slow FPGA: Co-execution of FPGA+CPU Partial Offload: Co-execution of FPGA + CPU M1 tasks: Offload acceleratable kernels to FPGAs M2 tasks: Instead of offload, schedule to CPUs 23 Outline Introduction Chip Level Performance and Energy Modeling Energy Efficiency of Full Pipelining (Chapter 2) Latte: Frequency and performance optimization (Chapter 3)

Node Level Performance and Cost Modeling: Computation Resources (Chapter 4) Doppio, storage resources (Chapter 5) Cluster Level Performance and Cost Modeling Public Cloud Cost Optimization: Composable Compute Instance: Mocha (Chapter 6) Compute + Storage Co-optimization (Chapter 7) Private Cloud Throughput Optimization (Chapter 8) Conclusion 24 Mocha: Multinode Optimization of Cost in Heterogeneous Cloud with Accelerators 1: Profiling r + Speedup S -> Matching Core Number, P 2. Setting up cluster HTC on AWS, P = 64

Instances: f1.2x: 8, m4.10x: 40, m4.4x: 16 3. Mocha Runtime CPU clients <-> NAM <-> Accelerator Optimization Results HTC on AWS EC2 Latency improves 1.60x -> 1.58x, almost same performance Cost 2.58x -> 0.93x, cost efficiency improvement: 2.77x f1.16x (64 CPU+8FPGA) ->f1.2x + m4.16x (64 CPU +1 FPGA) Make more use of 1 FPGAs Optimization Results Mutect2 on Huawei Cloud Latency improves 1.49x -> 2.27x, performance improvement: 1.5x Cost 1.16x -> 0.76x, cost efficiency improvement: 1.5x

Effective CPUs: 8 -> 32, Make more use of CPUs Outline Introduction Chip Level Performance and Energy Modeling Energy Efficiency of Full Pipelining (Chapter 2) Latte: Frequency and performance optimization (Chapter 3) Node Level Performance and Cost Modeling: Computation Resources (Chapter 4) Doppio, storage resources (Chapter 5) Cluster Level Performance and Cost Modeling Public Cloud Cost Optimization:

Composable Compute Instance: Mocha (Chapter 6) Compute + Storage Co-optimization (Chapter 7) Private Cloud Throughput Optimization (Chapter 8) Conclusion 28 Problem Formulation: Public Cloud, Cost Optimization 1 Read Samples: One genome sequence? Batch of genome sequences Deadline: 6 hours ? 12 hours? 24 hours? No DDL constraints? Objective: Minimum cost, in public cloud like Amazon EC2

UCLA CDSC $ ? $billions savings! Prior Work ILP formulations are common Cost of running scientific workloads (astronomy application) on public clouds, different number of scheduled processors [DSL08] Cost-optimal scheduling in public cloud, formulates the problem as a linear programming, different memory, CPU, and data communication overhead, no storage considered [VVB10] Limitations Storage not considered Composable instances not considered

[DSL08] Ewa Deelman, Gurmeet Singh, Miron Livny, Bruce Berriman, and John Good. The cost of doing science on the cloud: the montage example." In SC08: Proceedings of the 2008 CM/IEEE conference on Supercomputing, [VVB10] R. Van den Bossche, K. Vanmechelen, and J. Broeckhove. \Cost-Optimal Scheduling in Hybrid IaaS Clouds for Deadline Constrained Workloads. In 2010 IEEE 3rd International Conference on Cloud Computing UCLA CDSC Public Cloud: Storage + Compute 31 Multiple Computation Instances Different instances have different CPU types

In each CPU type series, different instances have a different number of CPU cores, memory sizes and prices UCLA CDSC Multiple Storage Choices -> I/O-Aware Modeling Storage Choices: SSD io1 price: $0.125 per GB month, 0.125*250/24/30 = $0.0434/hr SSD gp2 price: $0.1 per GB month, 0.1*250/24/30 = $0.0347/hr Doppio [ISPASS18] I/O-Aware Analytic Model Quantify the I/O impact Customized storage for optimal cost UCLA CDSC Multiple Storage Choices, Size or IOPs driven?

For genome pipeline, size matters vcf: 20G fastq: 3G Input fastq: 104GB Output Bam: 110G Local folder: 250GB https://aws.amazon.com/ebs/pricing/ Choose SSD gp2 price, Update the table price SSD io1 price: $0.125 per GB month, 0.125*250/24/30 = $0.0434/hr SSD gp2 price: $0.1 per GB month, 0.1*250/24/30 = $0.0347/hr UCLA CDSC Different Runtime of Stages in Instances Amazon EC2 instances and runtime (seconds) for different stages. UCLA CDSC Scheduling by Solving MILP (Mixed Integer Linear

Programming) Problem Inputs: Objective: minimize total cost Constraints (details later) Output: Allocation variable indicating genome s, stage t is allocated to type i, m-th instance UCLA CDSC MILP Constraints Scheduling Constraints: Hardware resource constraints: if two tasks are scheduled onto

the same instance, then the two tasks do not overlap Deadline constraints: finish time < DDL Task dependency constraints: BWA -> BQSR -> HTC Solver: IBM CPLEX Studio Scheduling Results for One Read Cost under different time constraints When deadline is as small as 19800 seconds (5.5hrs) $70 to finish a WGS When deadline is as large as 87800 seconds (24.3hrs) $10 to finish a WGS UCLA CDSC Scheduling Results When Mocha is Considered

Composable Instances New instance type, HTC: f1.2x + m4.16x cost to meet a deadline as small as 19800 seconds is reduced from $70 to $56 UCLA CDSC Scheduling Results for Batches of Reads The cost savings from 1 -> 2 comes from reduced overhead for node setup, and the benefit is small 80 numSeq = 1 numSeq = 2 $10.7, $10.6, ddl> 87800 secs ddl> 13900 secs 70

60 50 40 S0 bwa 30 S1 bwa S0 bqsr 20 S1 bqsr S0 htc 10 0 0 20000 40000

60000 80000 100000 120000 140000 160000 180000 200000 UCLA CDSC S1 htc Batches of Genome Sequences (Reads) Number of Genome Sequences = 2, 3, 4, 5 numSeq = 1, minCost = 10.734 when ddl > 87800 numSeq = 2, minCost = 10.616 when ddl > 139000, - 1.096% numSeq = 3, minCost = 10.557 when ddl > 187331, - 0.554% numSeq = 4, minCost = 10.528 when ddl > 237748, - 0.279% numSeq = 5, minCost = 10.513 when ddl > 288165, - 0.140% UCLA CDSC Insight

Duplication of configuration of numSeq = 2 is good enough for any given number of sequences S0 bwa S1 bwa S0 bqsr S1 bqsr S0 htc S0 bwa S1 htc S1 bwa S0 bqsr S1 bqsr S0 htc S1 htc S0 bwa

S1 bwa S0 bqsr S1 bqsr S0 htc S1 htc Outline Introduction Chip Level Performance and Energy Modeling Energy Efficiency of Full Pipelining (Chapter 2) Latte: Frequency and performance optimization (Chapter 3) Node Level Performance and Cost Modeling: Computation Resources (Chapter 4)

Doppio, storage resources (Chapter 5) Cluster Level Performance and Cost Modeling Public Cloud Cost Optimization: Composable Compute Instance: Mocha (Chapter 6) Compute + Storage Co-optimization (Chapter 7) Private Cloud Throughput Optimization (Chapter 8) Conclusion 43 Problem Formulation: Private Cloud, Throughput Optimization Given hardware: 56 core CPU, 2TB disk

Given number of sequences (reads) Objective: Minimum runtime UCLA CDSC Private Cloud, Runtime Matters Objective is to minimize runtime Assumption: CPU cores on a server is limited, 56 cores constant c0, parallel part t0; c0+ t0/P Machine storage is limited -> parallel genomes are limited, for example, on 2TB server, at most 2000GB/250GB = 8 whole genomes can run in parallel UCLA CDSC

Scheduling by Solving MILP Problem Inputs: Objective: minimize total runtime Constraints (details later) Output: pi,j Allocated number of cores for genome i UCLA CDSC MILP Constraints Scheduling Constraints:

Hardware resource constraints: if two tasks are scheduled onto the same CPU cores / storage space, then the two tasks do not overlap Dependency constraints: if two tasks have dependency, then the start time of the latter task >= end time of the previous one Solver: IBM CPLEX Studio Scheduling Results Solver: IBM CPLEX Cores: 56, Storage Space: 8, number of Genomes: 1, 2, 3, , First 4 can give results <5 mins, 4+ goes more than 24 hrs, # Genomes 1 2

3 4 5 Optimal Runtime (seconds) 26071 43074 60269 77080 ?? (huge computation time) UCLA CDSC When #genome = 5, > 27 hrs, not finished yet Configuration of Optimal Results Evenly Splitting #genome = 1, both stages use 56 cores #genome = 2, both stages in two genomes use 28 cores

#genome = 4, both stages in four genomes use 14 cores #genome = 3, as even as possible for each stage, split 56 cores to 19,19,18 for stage, 18,18,20 for stage 2 (explained later) 19 35607 18 24662 19 35607 18 24662 18 37304 20 22596 ILP

For 4<#genome<=8, it takes extremely long time even with multithreading mode for CPLEX. We put forward 3 heuristics, check their error rates against optimal results on 3 problem sizes, and pick the best heuristic to obtain a good runtime when 4<#genome<=8. 3 Problem sizes case (first two are manually constructed): Case A: #cores is 14, #storage space is 2 Case B: #cores is 28, #storage space is 4 Case C: #cores is 56, #storage space is 8 UCLA CDSC Heuristic 1 When there are more genomes, construct from smaller sequential patterns, when there are fewer genomes Calculating #genomes = 1,2 are fast for all 3 problem sizes

We know optimal when #genome = 1, and #genome = 2, Heuristic: We can use the base results to construct any number of genomes, for example, when #genome = 3: 1 genome 2 genomes #genome = 4, #genomes=2 + #genomes=2 #genome + #genomes=2 1 genome 2 genomes 2 genomes+ #genomes=1 genomes 2 genomes = 5,2 #genomes=2 4 genomes 5 genomes Optimal Results vs Heuristic 1

For case A, cores 14, storage space 2 We know #genome = 1, #genome = 2 We use two base results to sequentially construct #genome = 3 (sum of runtime of #genome = 1, #genome = 2) #genomes 1 2 3 4 5 6 7 8 optimal 77080 145091 221392 290182 365340 435273 510431 580364

heuristic 1 77080 145091 222171 290182 367262 435273 512353 580364 error 0.000% 0.000% 0.552% 0.000% 0.526% 0.000% 0.377% 0.000% Optimal Results vs Heuristic 1

For case B, cores 28, storage space 4 We know #genome = 1, #genome = 2 We use two base results to sequentially construct #genome = 3 (sum of runtime of #genome = 1, #genome = 2) #genomes 1 2 3 4 5 6 7 8 optimal 43074 77080 113586 145091 huge computation huge computation huge computation huge computation

heuristic 1 43074 77080 120154 154160 197234 231240 274314 308320 error 0.000% 0.000% 5.782% 6.251% na na na na Optimal Results vs Heuristic 1

For case C, cores 56, storage space 8 #genomes 1 2 3 4 5 6 7 8 optimal 26071 43074 60269 77080 huge computation huge computation huge computation huge computation heuristic 1 26071

52142 69145 86148 112219 129222 155293 172296 error 0.000% 0.000% 14.727% 11.764% na na na na Heuristic 2 Distribute cores among genomes as even as possible When #genomes <= #storage space, we can always distribute cores among genomes evenly

For example, Case C, cores 56, storage space 8, #genome =3, distribute the cores as 19, 19, 18, in total 56 cores 19 35607 19 23575 19 35607 19 23575 18 37304 18 24662 heuristc2: 61966 Optimal Results vs Heuristic 2 For case B, cores 28, storage space 4 When #genomes <= #storage space, we can always distribute cores among genomes evenly Case B, cores 28, storage space 4, #genome =3, distribute the cores as 9, 9, 10 #genomes

1 2 3 4 5 6 7 8 optimal heuristic 1 43074 43074 77080 77080 113586 120154 145091 154160 huge 197234 computation huge 231240 computation huge 274314

computation huge 308320 computation error 0.000% 0.000% 5.782% 6.251% na na na na heuristic 2 43074 77080 114864 145091 188165 222171 259955 290182

error 0.000% 0.000% 1.125% 0.000% na na na na Optimal Results vs Heuristic 2 For case C, cores 56, storage space 8 Case C, cores 56, storage space 8, #genome =7, distribute the cores as 8,8,8,8,8,8,8 Further reduce the difference compared to optimal results Achieve smaller makespan compared to heuristic 1 For mod(56, #seq) = 0, error = 0%; small difference when mod(56, #seq) != 0 #genomes 1 2 3

4 5 6 7 8 optimal heuristic 1 error 26071 26071 0.000% 43074 43074 0.000% 60269 69145 14.727% 77080 86148 11.764% huge 103151 na computation

huge 129222 na computation huge 146225 na computation huge 172296 na computation heuristic 2 26071 43074 61966 77080 95628 114864 128089 145091 error

0.000% 0.000% 2.816% 0.000% na na na na Runtime Experiments Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz R = 56 cpu cores, Y = 8 storage spaces, NA12878Garvan Average error < 4.2% Stage0: bwa Stage1: bqsr+htc Heuristic 2 vs Optimal Results #core=56, #genome = 3, Heuristic 2 -> optimal

configurations 19 35607 18 24662 19 35607 19 23575 19 35607 18 37304 19 23575 18 24662 heuristc2: 61966 19 35607 18 37304 18 24662 20 22596 Optimal: 60269 Can This Apply to Other #seq?

Balance-aware Heuristic 2 Check #seq = 5 does not work, heuristic 2: 95628 if redistribute in stage 2, runtime -> 103917 increases 12 53418 12 34997 12 53418 11 57813 8 50499 12 34997 11 57813 11 37815 11 57813 11 37815

11 57813 12 34997 11 57813 11 37815 11 57813 12 34997 11 57813 11 37815 11 57813 12 34997 UCLA CDSC Summary

For case C, cores 56, storage space 8 Case C, cores 56, storage space 8, #genome =6, Achieve smaller makespan compared to heuristic 2 for some points, for example, #seq = 6, we can use optimal runtime of #seq = 3 in Case B: core 28, storage space 4, 114864 -> 113586 #genomes optimal heuristic1 error heuristic2 error *heuristic2 error 1

26071 26071 0.000% 26071 0.000% 26071 0.000% 2 43074 43074 0.000% 43074

0.000% 43074 0.000% 3 60269 69145 14.727% 61966 2.816% 61966 0.000% 4 5

6 7 8 77080 86148 103151 129222 146225 172296 11.764% na na na na 77080 95628 114864 128089 145091

0.000% na na na na 77080 95628 113586 128089 145091 0.000% na na na na huge computation huge computation huge computation huge computation For #genome > #storage space =

8 Heuristic: construct any number of genomes using base cases when #genome <= #storage space For case C, cores 56, storage space 8, #genome > 8 Two methods Method 1: treat 8 genomes as a batch, and last batch pick the configuration from the base cases Method 2: ILP, introduce variables including base_case_1, base_case_2,, base_case_8 (continued in next slide) UCLA CDSC Method 2: ILP Input parameters:

Variables: Objective Function: Constraints: UCLA CDSC Results of Method 1 vs 2 When N = 9..100 Runtime of Method 1-Method 2 Runtime when #genome increases 3000

2000000 1800000 2500 1600000 2000 1400000 1200000 method1: last batch 1000000 1500 1000 800000 method 2: ILP 600000

500 400000 0 200000 0 0 0 20 40 60 80 100

Method 2 of using ILP to compose base cases are faster than method 1 of pick last batch from base cases for some #genome 20 40 60 80 100 120 Difference 191 explained: #genome = 11, method 1: 8+3, method 2: 7+4 Difference 2499 explained: #genome = 14, method 1: 8+6, method 2: 7+7 Difference 1543 explained: #genome = 21, method 1: 8+8+5, method 2: 7+7+7 UCLA CDSC 120 Conclusions

Public cloud cost optimization Composable instances optimize the cost compared to prior work Private cloud throughput optimization Heuristic: balance-aware splitting algorithm UCLA CDSC References and Acknowledgement Related publication Before thesis proposal [FCCM 16] P. Zhou, H. Park, Z. Fang, J. Cong, A. DeHon, Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016 [ISPASS 18] P. Zhou, Z. Ruan, Z. Fang, M. Shand, D. Roazen, J. Cong, Doppio: I/O-Aware Performance Analysis, Modeling and Optimization for In-Memory Computing Framework, to appear in IEEE International Symposium on Performance Analysis of Systems and Software

(ISPASS), 2018. Best Paper Candidate (4 out of 21 accepted) After thesis proposal [FCCM 18] J. Cong, P. Wei, C.H. Yu, and P. Zhou, Latte: Locality Aware Transformation for High-Level Synthesis, IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2018 Acknowledgement Energy Pipeline: Zhenman Fang, Hyunseok Park et al Doppio: Zhenyuan Ruan, Zhenman Fang et al Latte: Cody Hao Yu, Peng Wei et al 66 Mocha: Jiayi Sheng, Di Wu, Cody Hao Yu, Muhuan Huang, Jie Wang et al Thank you! Q&A 67 Backup Slides 68

ILP Input Parameters UCLA CDSC 69 ILP 2N 2N 2N 2N*R N*Y 22 2 1 2N*R N, #whole genome; R: #cores in a server; Y: set of storage spaces.

#Variables: 6 + 5 + +5 2 70 ILP Objective: Constraints , 2 2

2 , = UCLA CDSC 71 ILP Constraints: +4 + If two tasks are scheduled on a same core, they ( 2 2 )4 can not overlap: ( 2 2 ) If two genomes are scheduled to use same 4

storage space, they can not overlap 2 2 UCLA CDSC 72 Outline Introduction Chip Level Performance and Energy Modeling Energy Efficiency of Full Pipelining (Chapter 2) Latte: Locality Aware Transformation for High-Level SynThEsis (Chapter 3) Node Level Performance and Cost Modeling:

Computation Resources (Chapter 4) Doppio: I/O-aware Performance analysis, moDeling and OPtimization for inmemory computing framework. Storage Resources (Chapter 5) Cluster Level Performance and Cost Modeling Mocha: Multinode Optimization of Cost in Heterogeneous Cloud with Accelerators (Chapter 6) Applications of Modeling and Optimization: Genome Pipeline Public Cloud Cost Optimization (Chapter 7) Private Cloud Performance Optimization (Chapter 8) Conclusion 73