使用nsight-compute分析核函数性能

在使用nvcc编译获得可执行文件后 即可使用nsight-compute完成性能分析

  • 使用nvidia- container构建的镜像已安装好nsight-compute-cli

  • 也可安装图形化界面

使用实例

基本使用

1
$ ncu ./my_cuda_speedup_solutions 512 6

其中my_cuda_speedup_solutions为可执行文件名 512 6 为参数

获取结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
v6_kernel(args, float *, float *, float *), 2025-Jan-06 05:26:00, Context 1, Stream 7
Section: GPU Speed Of Light Throughput
---------------------------------------------------------------------- --------------- ------------------------------
DRAM Frequency cycle/nsecond 3.99
SM Frequency cycle/nsecond 1.41
Elapsed Cycles cycle 1661720
Memory [%] % 73.71
DRAM Throughput % 17.43
Duration msecond 1.18
L1/TEX Cache Throughput % 77.41
L2 Cache Throughput % 13.22
SM Active Cycles cycle 1666873.38
Compute (SM) [%] % 73.71
---------------------------------------------------------------------- --------------- ------------------------------
INF Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced.
Check both the Compute Workload Analysis and Memory Workload Analysis sections.

Section: Launch Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Block Size 1024
Function Cache Configuration cudaFuncCachePreferNone
Grid Size 256
Registers Per Thread register/thread 38
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block Kbyte/block 16.64
Static Shared Memory Per Block byte/block 0
Threads thread 262144
Waves Per SM 16
---------------------------------------------------------------------- --------------- ------------------------------

Section: Occupancy
---------------------------------------------------------------------- --------------- ------------------------------
Block Limit SM block 16
Block Limit Registers block 1
Block Limit Shared Mem block 1
Block Limit Warps block 1
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 100
Achieved Occupancy % 99.93
Achieved Active Warps Per SM warp 31.98
---------------------------------------------------------------------- --------------- ------------------------------
INF This kernel's theoretical occupancy is not impacted by any block limit.

查看更多信息

1
nv-nsight-cu-cli --list-metrics | awk '{print $1}' | tail -n +2 | paste -sd "," - | xargs -I {} echo nv-nsight-cu-cli --metrics={} ./my_cuda_speedup_solutions 512 6

使用上述命令获取可查询数据列表

1
nv-nsight-cu-cli --metrics=-sm__warps_active.avg.per_cycle_active,sm__warps_active.avg.pct_of_peak_sustained_active,-sm__warps_active.avg.pct_of_peak_sustained_active,sm__throughput.avg.pct_of_peak_sustained_elapsed,-sm__throughput.avg.pct_of_peak_sustained_elapsed,sm__maximum_warps_per_active_cycle_pct,-sm__maximum_warps_per_active_cycle_pct,sm__maximum_warps_avg_per_active_cycle,-sm__maximum_warps_avg_per_active_cycle,sm__cycles_active.avg,-sm__cycles_active.avg,lts__throughput.avg.pct_of_peak_sustained_elapsed,-lts__throughput.avg.pct_of_peak_sustained_elapsed,launch__waves_per_multiprocessor,-launch__waves_per_multiprocessor,launch__thread_count,-launch__thread_count,launch__shared_mem_per_block_static,-launch__shared_mem_per_block_static,launch__shared_mem_per_block_dynamic,-launch__shared_mem_per_block_dynamic,launch__shared_mem_per_block_driver,-launch__shared_mem_per_block_driver,launch__shared_mem_per_block,-launch__shared_mem_per_block,launch__shared_mem_config_size,-launch__shared_mem_config_size,launch__registers_per_thread,-launch__registers_per_thread,launch__occupancy_per_shared_mem_size,-launch__occupancy_per_shared_mem_size,launch__occupancy_per_register_count,-launch__occupancy_per_register_count,launch__occupancy_per_cluster_size,-arch:90:90:launch__occupancy_per_cluster_size,launch__occupancy_per_block_size,-launch__occupancy_per_block_size,launch__occupancy_limit_warps,-launch__occupancy_limit_warps,launch__occupancy_limit_shared_mem,-launch__occupancy_limit_shared_mem,launch__occupancy_limit_registers,-launch__occupancy_limit_registers,launch__occupancy_limit_blocks,-launch__occupancy_limit_blocks,launch__occupancy_cluster_pct,-arch:90:90:launch__occupancy_cluster_pct,launch__occupancy_cluster_gpu_pct,-arch:90:90:launch__occupancy_cluster_gpu_pct,launch__grid_size,-launch__grid_size,launch__func_cache_config,-launch__func_cache_config,launch__cluster_size,-arch:90:90:launch__cluster_size,launch__cluster_scheduling_policy,-arch:90:90:launch__cluster_scheduling_policy,launch__cluster_max_potential_size,-arch:90:90:launch__cluster_max_potential_size,launch__cluster_max_active,-arch:90:90:launch__cluster_max_active,launch__block_size,-launch__block_size,l1tex__throughput.avg.pct_of_peak_sustained_active,-l1tex__throughput.avg.pct_of_peak_sustained_active,gpu__time_duration.sum,-gpu__time_duration.sum,gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,-arch:89:90:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,-arch:75:86:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,-arch:40:70:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed,gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,-gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,gpc__cycles_elapsed.max,-gpc__cycles_elapsed.max,gpc__cycles_elapsed.avg.per_second,-gpc__cycles_elapsed.avg.per_second,dram__cycles_elapsed.avg.per_second,-arch:89:90:dram__cycles_elapsed.avg.per_second,-arch:75:86:dram__cycles_elapsed.avg.per_second,-arch:40:70:dram__cycles_elapsed.avg.per_second,breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed,-breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed,breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed,-breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed ./my_cuda_speedup_solutions 512 6

直接运行上述指令

bank conflict

不知道为什么,个人感觉很重要的bank conflict相关数据,上述指令无法查询(使用cuda为11.8)

为了查询bank_conflict,运行以下指令

1
ncu --metrics      l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum  ./my_cuda_speedup_solutions 4096 6

结果如下:

1
2
3
4
5
6
v6_kernel(args, float *, float *, float *), 2025-Jan-06 05:40:13, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 0
---------------------------------------------------------------------- --------------- ------------------------------


使用nsight-compute分析核函数性能
http://zzsy.me/2025/01/06/使用nsight-compute分析核函数性能/
作者
yuanyuan
发布于
2025年1月6日
更新于
2025年1月6日
许可协议