当前位置：首页 > news >正文

CS149ParallelComputing_NotesAssignmentsd

news 2026/6/14 22:49:42

Lecture Notes

视频链接

课程网站

Lecture7 GPU architecture and CUDA Programming

其他更多参考：CUDA Programming Guide — CUDA Programming Guide

基本概念以及编程模型

基本概念

Cuda编程语言是对GPU硬件的一种“非图形学特定”（non-graphics-specific）的编程接口，在2007年 NVIDIA Tesla 架构中亮相。

Cuda程序由多层级的并发线程构成。线程ID可以至多是3维的（下图例子是2维）

launch 一个cuda kernel函数的写法如下图所示，<<< >>> 中指定gridnum和blocknum。以下图为例，maxtrixAdd <<<numblocks, threadPerBlock>>>(A, B, C) launch的grid维度为3*2，每个grid包含4 * 3个“线程”，每个线程都会运行函数maxtrixAdd 中的代码，且每个线程在执行过程中都可以用blockIdx、blockDim以及threadIdx这几个内置变量得到自身所处的“位置”。

Cuda内存模型

宿主机（host）内存空间和设备（gpu）内存空间是分开的。如果要把宿主机上的一份数据传递到gpu中，需要先在gpu中分配一段内存，然后再讲数据拷贝到gpu中
gpu内存也分3种：每个线程自己的内存，每个block的内存（由block中的所有线程共享，由关键词__shared__标识）以及全局内存（有 cudaMalloc分配）。这三种内存也对应了三种不同的“局部性”（locality），shared内存对性能更友好

同步：

__syncthreads(): 一种Barrier，等block的所有线程都执行到调用__syncthread()那一行，再往下执行剩余的程序
原子操作，比如 atomicAdd
Host/Device 同步：kernel函数返回时所有线程之间存在隐式barrier

NVIDIA V100 硬件架构

NVIDIA V100 Gpu 一共有80个 SM（Streaming Multiprocessor），它们共享一个L2Cache。

每个SM都由四个sub-core组成，每个sub_core配有至多16 * 32套执行上下文（R0、R1...寄存器，都是scalar的），每32套上下文组成一个“warp”。如下图所示，每个SM都有如下结构：

每个sub-core有一个Warpselector，运行阶段sub-core选择一个可运行的“warp”进行运算，为该warp的所有线程取得下一个（且同一个）instruction并运算（有些线程可能不运行，这取决于该warp中线程的diverge程度），每个sub-core都配有几个运行单元，如下图所示。

虽然cuda号称使用的是“SIMT”（single instruction multiple thread）编程模型，但如果同一个warp的32个线程都执行同一个指令，事实上就是一种“SIMD”运行方式，且类似于ISPC，执行流divergence也会导致性能下降 —— “If the 32 CUDA threads do not share the same instruction, performance can suffer due to divergent execution”

Assignments

repo链接

照着23年的Assignments写的

Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
Assignment 2: Scheduling Task Graphs on a Multi-Core CPU
Assignment 3: A Simple Renderer in CUDA

Assignment 3

使用Cuda进行加速运算的一个project，由于不熟悉Cuda，所以写起来比较吃力。

环境：

操作系统：wsl
显卡版本：NVIDIA GeForce RTX 3050 Laptop GPU
nvcc版本：12.3，注意有专门的wsl版本 CUDA Toolkit 12.3 Downloads | NVIDIA Developer
g++版本：12

此外，ref程序需要特定的cudaruntime版本，但是太高的版本又和我本地环境不适配，所以checkout了23年的某个commit，然后一把copy了过来。

SAXPY

练手的，用Cuda再实现一遍asst1中的SAXPY，熟悉cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize 以及如何launch一个cuda核函数。

试验结果对比asst1中的ispc如下所示，需要将asst1中的数据量改成本实验的数据量对比：

ispc实现：

[saxpy serial]:         [76.086] ms     [19.585] GB/s   [2.629] GFLOPS
[saxpy ispc]:           [75.910] ms     [19.630] GB/s   [2.635] GFLOPS
[saxpy task ispc]:      [57.970] ms     [25.705] GB/s   [3.450] GFLOPS

cuda实现：

Found 1 CUDA devices
Device 0: NVIDIA GeForce RTX 3050 Laptop GPUSMs:        16Global mem: 4096 MBCUDA Cap:   8.6
---------------------------------------------------------
Running 3 timing tests:
Effective BW by CUDA saxpy: 233.917 ms          [4.778 GB/s]
Effective BW of kernel calc: 10.398 ms          [107.480 GB/s]
Effective BW by CUDA saxpy: 173.751 ms          [6.432 GB/s]
Effective BW of kernel calc: 6.959 ms           [160.595 GB/s]
Effective BW by CUDA saxpy: 192.386 ms          [5.809 GB/s]
Effective BW of kernel calc: 7.489 ms           [149.233 GB/s]

对比计算时间，cuda的计算效率高很多

[saxpy task ispc]:      [57.970] ms     [25.705] GB/s   [3.450] GFLOPS
Effective BW of kernel calc: 7.489 ms           [149.233 GB/s]

但是cuda内存传输的时间非常耗时，贷款达149.233 GB/s 比较接近理论带宽（193GB/s）,说明程序瓶颈在带宽上。

Scan

实现exclusive_scan，具体实现可以参考视频教程Lecture8，PA的readme也有些对应的伪代码：

void exclusive_scan_iterative(int* start, int* end, int* output) {int N = end - start;memmove(output, start, N*sizeof(int));// upsweep phasefor (int two_d = 1; two_d <= N/2; two_d*=2) {int two_dplus1 = 2*two_d;parallel_for (int i = 0; i < N; i += two_dplus1) {output[i+two_dplus1-1] += output[i+two_d-1];}}output[N-1] = 0;// downsweep phasefor (int two_d = N/2; two_d >= 1; two_d /= 2) {int two_dplus1 = 2*two_d;parallel_for (int i = 0; i < N; i += two_dplus1) {int t = output[i+two_d-1];output[i+two_d-1] = output[i+two_dplus1-1];output[i+two_dplus1-1] += t;}}
}

不是很难，照着实现就行，只有一个注意点就是注意爆int，我踩过坑所以把相应的kernel函数中int全改成了long long

__global__ void
upsweep_kernel(int N, int* inputArray, int stride)
{// int idx = (blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1;long long idx = (blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1; // avoid integer overflowinputArray[idx] += inputArray[idx - (stride >> 1)];
}__global__ void
downsweep_kernel(int N, int* inputArray, int stride)
{// int idx = ( blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1;long long idx = ( blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1; // avoid integer overflowint tmp = inputArray[idx - (stride >> 1)];inputArray[idx - (stride >> 1)] = inputArray[idx];inputArray[idx] += tmp;
}

然后是实现find_repeats功能，也不是很难，调用链是 fill_repeat_flags => exclusive_scan => scatter_flags，在实现第一个和第二个函数即可，没什么要注意的。

最后的得分如下，不清楚为什么数据量小的时候会有这么大差距

-------------------------
Scan Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count   | Ref Time        | Student Time    | Score           |
-------------------------------------------------------------------------
| 1000000         | 1.589           | 2.26            | 0.8788716814159293 |
| 10000000        | 14.072          | 12.157          | 1.25            |
| 20000000        | 27.321          | 22.288          | 1.25            |
| 40000000        | 53.277          | 42.531          | 1.25            |
-------------------------------------------------------------------------
|                                   | Total score:    | 4.628871681415929/5.0 |
--------------------------------------------------------------------------------------------------
Find_repeats Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count   | Ref Time        | Student Time    | Score           |
-------------------------------------------------------------------------
| 1000000         | 2.738           | 3.746           | 0.9136412172984516 |
| 10000000        | 20.158          | 20.511          | 1.25            |
| 20000000        | 39.049          | 38.149          | 1.25            |
| 40000000        | 76.606          | 70.088          | 1.25            |
-------------------------------------------------------------------------
|                                   | Total score:    | 4.663641217298451/5.0 |
-------------------------------------------------------------------------

render

实现一个简单的渲染器，refRenderer.cpp 文件已经给出了一个正确的串行方案，要求我们使用cuda语言实现并行版本。PA已经给出了一个可运行但是不正确的并行方案，它是以圆的颗粒度进行并行计算的，但是这破坏圆颜色之间的依赖关系，造成最终渲染结果的混乱。

__global__ void kernelRenderCircles() {int index = blockIdx.x * blockDim.x + threadIdx.x;if (index >= cuConstRendererParams.numCircles)  // index 是圆的索引return;// .....// for all pixels in the bonding boxfor (int pixelY=screenMinY; pixelY<screenMaxY; pixelY++) {float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + screenMinX)]);for (int pixelX=screenMinX; pixelX<screenMaxX; pixelX++) {float2 pixelCenterNorm = make_float2(invWidth * (static_cast<float>(pixelX) + 0.5f),invHeight * (static_cast<float>(pixelY) + 0.5f));shadePixel(index, pixelCenterNorm, p, imgPtr);imgPtr++;}}
}

实现1

基于PA readme中提示，“There are two potential axes of parallelism in this assignment. One axis is parallelism across pixels another is parallelism across circles ”，可以基于图像的每个像素进行并行计算，在kernel函数中再按照圆的依赖顺序依次遍历并调用shadePixel进行渲染。这样的实现是正确的，但是性能很差：

__global__ 
void kernelRenderPixels()
{int index = blockIdx.x * blockDim.x + threadIdx.x;int imageWidth = cuConstRendererParams.imageWidth;int imageHeight = cuConstRendererParams.imageHeight;if (index > imageWidth * imageHeight) {return;}float invWidth = 1.f / imageWidth;float invHeight = 1.f / imageHeight;int pixelY = index / imageWidth;int pixelX = index % imageWidth;float2 pixelCenterNorm = make_float2(invWidth * (static_cast<float>(pixelX) + 0.5f),invHeight * (static_cast<float>(pixelY) + 0.5f));float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * index]);for (int circleIndex = 0; circleIndex < cuConstRendererParams.numCircles; ++circleIndex){float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * circleIndex]);shadePixel(circleIndex, pixelCenterNorm, circlePosition, imgPtr);}
}CudaRenderer::render() {// 256 threads per block is a healthy numberdim3 blockDim(256, 1);dim3 gridDim((image->width * image->height + blockDim.x - 1) / blockDim.x);kernelRenderPixels<<<gridDim, blockDim>>>();cudaDeviceSynchronize();
}

该实现的测试结果为：

--------------------------------------------------------------------------
| Scene Name      | Ref Time (T_ref) | Your Time (T)   | Score           |
--------------------------------------------------------------------------
| rgb             | 0.7622           | 0.6416          | 9               |
| rand10k         | 5.0429           | 77.2235         | 2               |
| rand100k        | 45.496           | 813.3614        | 2               |
| pattern         | 1.0921           | 8.9045          | 3               |
| snowsingle      | 29.2303          | 773.4755        | 2               |
| biglittle       | 27.4429          | 96.223          | 4               |
| rand1M          | 328.6162         | 8376.8588       | 2               |
| micro2M         | 606.1327         | 16774.7623      | 2               |
--------------------------------------------------------------------------
|                                    | Total score:    | 26/72           |
--------------------------------------------------------------------------

实现2

为了提升性能，我们把图片分成一个个16×16的小方块，每个threadblock负责这个小方块内的计算，主要包括：

并行地判断每个图片小方块是否与每个圆相交(并行判断256 个圆)
如果方块与圆相交，则再调用shadepixel进行渲染，如果不想交则直接跳过

#define BLOCKNUMX 16
#define BLOCKNUMY 16
#define BLOCKSIZE 256
__global__ 
void kernelRenderPixels()
{__shared__ int isBoxInCircle[BLOCKSIZE];int pixelX = blockIdx.x * blockDim.x + threadIdx.x;int pixelY = blockIdx.y * blockDim.y + threadIdx.y;int imageWidth = cuConstRendererParams.imageWidth;int imageHeight = cuConstRendererParams.imageHeight;float invWidth = 1.f / imageWidth;float invHeight = 1.f / imageHeight;if (pixelX >= imageWidth || pixelY >= imageHeight) {return;}int boxL = blockIdx.x * blockDim.x;int boxR = (min(blockIdx.x * blockDim.x + blockDim.x, imageWidth));int boxB = blockIdx.y * blockDim.y;int boxT = (min(blockIdx.y * blockDim.y + blockDim.y, imageHeight));float boxLInv = boxL * invWidth;float boxRInv = boxR * invWidth;float boxTInv = boxT * invHeight;float boxBInv = boxB * invHeight;int linearThreadIndex =  threadIdx.y * blockDim.x + threadIdx.x;float2 pixelCenterNorm = make_float2(invWidth * (static_cast<float>(pixelX) + 0.5f),invHeight * (static_cast<float>(pixelY) + 0.5f));for (int batchStartIndexForCircles = 0; batchStartIndexForCircles < cuConstRendererParams.numCircles;batchStartIndexForCircles += BLOCKSIZE){int indexForCircles = batchStartIndexForCircles + linearThreadIndex;if (indexForCircles >= cuConstRendererParams.numCircles){isBoxInCircle[linearThreadIndex] = 0;}else {float circleX = cuConstRendererParams.position[3 * indexForCircles];float circley = cuConstRendererParams.position[3 * indexForCircles + 1];float circleRadius = cuConstRendererParams.radius[indexForCircles];isBoxInCircle[linearThreadIndex] = circleInBoxConservative(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv) ? circleInBox(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv) : 0;}__syncthreads();for (int i = batchStartIndexForCircles; i< batchStartIndexForCircles + BLOCKSIZE; ++i){if (isBoxInCircle[i % BLOCKSIZE]){float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * i]);shadePixel(i, pixelCenterNorm, circlePosition, imgPtr);}}__syncthreads();}}void
CudaRenderer::render() 
{dim3 blockDim(BLOCKNUMX, BLOCKNUMY);dim3 gridDim((image->width + blockDim.x - 1) / blockDim.x, (image->height + blockDim.y - 1) / blockDim.y);kernelRenderPixels<<<gridDim, blockDim>>>();  
}

注意，最后一个__syncthreads是必须得加的，否则会导致线程安全问题（为了保护某一cuda线程在一次for循环中，isBoxInCircle不被复写）。最后再测下性能：

--------------------------------------------------------------------------
| Scene Name      | Ref Time (T_ref) | Your Time (T)   | Score           |
--------------------------------------------------------------------------
| rgb             | 0.8281           | 1.5998          | 6               |
| rand10k         | 5.1915           | 37.4328         | 3               |
| rand100k        | 45.6692          | 345.5081        | 3               |
| pattern         | 1.0687           | 4.5626          | 4               |
| snowsingle      | 28.8797          | 295.1067        | 2               |
| biglittle       | 26.8013          | 60.9962         | 6               |
| rand1M          | 339.1135         | 3399.8742       | 2               |
| micro2M         | 609.6217         | 6839.2456       | 2               |
--------------------------------------------------------------------------
|                                    | Total score:    | 28/72           |
--------------------------------------------------------------------------

Amazing！可以说是没有任何的性能提升呢......

实现3

没法了，去参考了别人的实现，发现他们除了对图像进行分块，还在第一个循环内使用sharedMemExclusiveScan进行优化：

__global__ 
void kernelRenderPixels()
{__shared__ uint isBoxInCircle[BLOCKSIZE];__shared__ uint prefixSumOutput[BLOCKSIZE];__shared__ uint prefixSumScratch[2 * BLOCKSIZE];__shared__ int inBoxCircleIndexes[BLOCKSIZE];// ...int linearThreadIndex =  threadIdx.y * blockDim.x + threadIdx.x;float2 pixelCenterNorm = make_float2(cuConstRendererParams.invWidth * (static_cast<float>(pixelX) + 0.5f),cuConstRendererParams.invHeight * (static_cast<float>(pixelY) + 0.5f));isBoxInCircle[linearThreadIndex] = 0;inBoxCircleIndexes[linearThreadIndex] = -1;for (int batchStartIndexForCircles = 0; batchStartIndexForCircles < cuConstRendererParams.numCircles;batchStartIndexForCircles += BLOCKSIZE){int indexForCircles = batchStartIndexForCircles + linearThreadIndex;if (indexForCircles < cuConstRendererParams.numCircles){float circleX = cuConstRendererParams.position[3 * indexForCircles];float circley = cuConstRendererParams.position[3 * indexForCircles + 1];float circleRadius = cuConstRendererParams.radius[indexForCircles];isBoxInCircle[linearThreadIndex] =  circleInBox(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv);}__syncthreads();// shoudl use sharedMemExclusiveScan to improve permance// but why the performance is ** far far** better than the one that do not do the execlusive scan improvement?sharedMemExclusiveScan(linearThreadIndex, isBoxInCircle, prefixSumOutput, prefixSumScratch, BLOCKSIZE);if (isBoxInCircle[linearThreadIndex]) {inBoxCircleIndexes[prefixSumOutput[linearThreadIndex]] = indexForCircles;}__syncthreads();int numOfIntescetedCircles = prefixSumOutput[BLOCKSIZE - 1] + isBoxInCircle[BLOCKSIZE - 1];for (int i = 0; i < numOfIntescetedCircles; ++i) // 只循环遍历实际相交的圆{float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * inBoxCircleIndexes[i]]);shadePixel(inBoxCircleIndexes[i], pixelCenterNorm, circlePosition, imgPtr);}}

起初我以为这个优化是微不足道的，因为第二层for循环最多也就循环256次（我认为，这和和测试集圆的数量——100k比起来可以约等于O（1）的复杂度），调用sharedMemExclusiveScan带来的cost可能还大于减少循环的收益。但测试结果显示，非常的Amazing啊，优化了非常多：

--------------------------------------------------------------------------
| Scene Name      | Ref Time (T_ref) | Your Time (T)   | Score           |
--------------------------------------------------------------------------
| rgb             | 0.6982           | 0.702           | 9               |
| rand10k         | 6.4562           | 7.4349          | 9               |
| rand100k        | 45.3627          | 55.6973         | 8               |
| pattern         | 1.0167           | 0.9699          | 9               |
| snowsingle      | 34.3391          | 37.9076         | 9               |
| biglittle       | 25.9874          | 53.0712         | 6               |
| rand1M          | 322.1914         | 325.6942        | 9               |
| micro2M         | 586.4424         | 597.0768        | 9               |
--------------------------------------------------------------------------
|                                    | Total score:    | 68/72           |
--------------------------------------------------------------------------

但是为什么呢？我只能猜测，实现3的for循环内部没有了if循环的分支，大大减少了execution divergence

// 实现2
for (int i = batchStartIndexForCircles; i< batchStartIndexForCircles + BLOCKSIZE; ++i)
{if (isBoxInCircle[i % BLOCKSIZE]){float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * i]);shadePixel(i, pixelCenterNorm, circlePosition, imgPtr);}
}
// 实现3
int numOfIntescetedCircles = prefixSumOutput[BLOCKSIZE - 1] + isBoxInCircle[BLOCKSIZE - 1];
for (int i = 0; i < numOfIntescetedCircles; ++i) // 只循环遍历实际相交的圆
{float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * inBoxCircleIndexes[i]]);shadePixel(inBoxCircleIndexes[i], pixelCenterNorm, circlePosition, imgPtr);
}

本想用ncu等性能测试工具验证这个猜想的，但是捣鼓了一晚上还是没对齐ncu、diver和cudaruntime的版本，所以就先这样吧。

此外，先调circleInBoxConservative再调circleInBox的方法，没有做到有效优化；把第一层for循环的if去掉了，也没什么优化：

    for (int batchStartIndexForCircles = 0; batchStartIndexForCircles < cuConstRendererParams.numCircles;batchStartIndexForCircles += BLOCKSIZE){int indexForCircles = batchStartIndexForCircles + linearThreadIndex;float circleX = cuConstRendererParams.position[3 * indexForCircles];float circley = cuConstRendererParams.position[3 * indexForCircles + 1];float circleRadius = cuConstRendererParams.radius[indexForCircles];isBoxInCircle[linearThreadIndex] =  circleInBox(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv);__syncthreads();// ...

查看全文

http://www.jsqmd.com/news/1014567/