当前位置: 首页 > news >正文

CS149ParallelComputing_NotesAssignmentsd

Lecture Notes

视频链接

课程网站

Lecture7 GPU architecture and CUDA Programming

其他更多参考:CUDA Programming Guide — CUDA Programming Guide

基本概念以及编程模型

基本概念

Cuda编程语言是对GPU硬件的一种“非图形学特定”(non-graphics-specific)的编程接口,在2007年 NVIDIA Tesla 架构中亮相。

Cuda程序由多层级的并发线程构成。线程ID可以至多是3维的(下图例子是2维)

image-20260610000202769

launch 一个cuda kernel函数的写法如下图所示,<<< >>> 中指定gridnum和blocknum。以下图为例,maxtrixAdd <<<numblocks, threadPerBlock>>>(A, B, C) launch的grid维度为3*2, 每个grid包含4 * 3个“线程”,每个线程都会运行函数maxtrixAdd 中的代码,且每个线程在执行过程中都可以用blockIdx、blockDim以及threadIdx这几个内置变量得到自身所处的“位置”。
image-20260610001948803

Cuda内存模型

  1. 宿主机(host)内存空间和设备(gpu)内存空间是分开的。如果要把宿主机上的一份数据传递到gpu中,需要先在gpu中分配一段内存,然后再讲数据拷贝到gpu中

  2. gpu内存也分3种:每个线程自己的内存,每个block的内存(由block中的所有线程共享, 由关键词__shared__标识)以及全局内存(有 cudaMalloc分配)。这三种内存也对应了三种不同的“局部性”(locality),shared内存对性能更友好

    image-20260610003146150

同步

  1. __syncthreads(): 一种Barrier,等block的所有线程都执行到调用__syncthread()那一行,再往下执行剩余的程序
  2. 原子操作,比如 atomicAdd
  3. Host/Device 同步:kernel函数返回时所有线程之间存在隐式barrier

NVIDIA V100 硬件架构

NVIDIA V100 Gpu 一共有80个 SM(Streaming Multiprocessor),它们共享一个L2Cache。

image-20260610230718015

每个SM都由四个sub-core组成,每个sub_core配有至多16 * 32套执行上下文(R0、R1...寄存器,都是scalar的),每32套上下文组成一个“warp”。如下图所示,每个SM都有如下结构:

image-20260610232202925

每个sub-core有一个Warpselector,运行阶段sub-core选择一个可运行的“warp”进行运算,为该warp的所有线程取得下一个(且同一个)instruction并运算(有些线程可能不运行,这取决于该warp中线程的diverge程度),每个sub-core都配有几个运行单元,如下图所示。

image-20260610230327682

虽然cuda号称使用的是“SIMT”(single instruction multiple thread)编程模型,但如果同一个warp的32个线程都执行同一个指令,事实上就是一种“SIMD”运行方式,且类似于ISPC,执行流divergence也会导致性能下降 —— “If the 32 CUDA threads do not share the same instruction, performance can suffer due to divergent execution”

Assignments

repo链接

照着23年的Assignments写的

  • Assignment 1: Analyzing Parallel Program Performance on a Quad-Core CPU
  • Assignment 2: Scheduling Task Graphs on a Multi-Core CPU
  • Assignment 3: A Simple Renderer in CUDA

Assignment 3

使用Cuda进行加速运算的一个project,由于不熟悉Cuda,所以写起来比较吃力。

环境:

  • 操作系统:wsl

  • 显卡版本:NVIDIA GeForce RTX 3050 Laptop GPU

  • nvcc版本:12.3,注意有专门的wsl版本 CUDA Toolkit 12.3 Downloads | NVIDIA Developer

  • g++版本:12

此外,ref程序需要特定的cudaruntime版本,但是太高的版本又和我本地环境不适配,所以checkout了23年的某个commit,然后一把copy了过来。

SAXPY

练手的,用Cuda再实现一遍asst1中的SAXPY,熟悉cudaMalloc, cudaMemcpy, cudaFree, cudaDeviceSynchronize 以及如何launch一个cuda核函数。

试验结果对比asst1中的ispc如下所示,需要将asst1中的数据量改成本实验的数据量对比:

ispc实现:

[saxpy serial]:         [76.086] ms     [19.585] GB/s   [2.629] GFLOPS
[saxpy ispc]:           [75.910] ms     [19.630] GB/s   [2.635] GFLOPS
[saxpy task ispc]:      [57.970] ms     [25.705] GB/s   [3.450] GFLOPS

cuda实现:

Found 1 CUDA devices
Device 0: NVIDIA GeForce RTX 3050 Laptop GPUSMs:        16Global mem: 4096 MBCUDA Cap:   8.6
---------------------------------------------------------
Running 3 timing tests:
Effective BW by CUDA saxpy: 233.917 ms          [4.778 GB/s]
Effective BW of kernel calc: 10.398 ms          [107.480 GB/s]
Effective BW by CUDA saxpy: 173.751 ms          [6.432 GB/s]
Effective BW of kernel calc: 6.959 ms           [160.595 GB/s]
Effective BW by CUDA saxpy: 192.386 ms          [5.809 GB/s]
Effective BW of kernel calc: 7.489 ms           [149.233 GB/s]

对比计算时间,cuda的计算效率高很多

[saxpy task ispc]:      [57.970] ms     [25.705] GB/s   [3.450] GFLOPS
Effective BW of kernel calc: 7.489 ms           [149.233 GB/s]

但是cuda内存传输的时间非常耗时,贷款达149.233 GB/s 比较接近理论带宽(193GB/s),说明程序瓶颈在带宽上。

Scan

实现exclusive_scan,具体实现可以参考视频教程Lecture8,PA的readme也有些对应的伪代码:

void exclusive_scan_iterative(int* start, int* end, int* output) {int N = end - start;memmove(output, start, N*sizeof(int));// upsweep phasefor (int two_d = 1; two_d <= N/2; two_d*=2) {int two_dplus1 = 2*two_d;parallel_for (int i = 0; i < N; i += two_dplus1) {output[i+two_dplus1-1] += output[i+two_d-1];}}output[N-1] = 0;// downsweep phasefor (int two_d = N/2; two_d >= 1; two_d /= 2) {int two_dplus1 = 2*two_d;parallel_for (int i = 0; i < N; i += two_dplus1) {int t = output[i+two_d-1];output[i+two_d-1] = output[i+two_dplus1-1];output[i+two_dplus1-1] += t;}}
}

不是很难,照着实现就行,只有一个注意点就是注意爆int,我踩过坑所以把相应的kernel函数中int全改成了long long

__global__ void
upsweep_kernel(int N, int* inputArray, int stride)
{// int idx = (blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1;long long idx = (blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1; // avoid integer overflowinputArray[idx] += inputArray[idx - (stride >> 1)];
}__global__ void
downsweep_kernel(int N, int* inputArray, int stride)
{// int idx = ( blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1;long long idx = ( blockIdx.x * blockDim.x + threadIdx.x + 1) * stride - 1; // avoid integer overflowint tmp = inputArray[idx - (stride >> 1)];inputArray[idx - (stride >> 1)] = inputArray[idx];inputArray[idx] += tmp;
}

然后是实现find_repeats功能,也不是很难,调用链是 fill_repeat_flags => exclusive_scan => scatter_flags,在实现第一个和第二个函数即可,没什么要注意的。

最后的得分如下,不清楚为什么数据量小的时候会有这么大差距

-------------------------
Scan Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count   | Ref Time        | Student Time    | Score           |
-------------------------------------------------------------------------
| 1000000         | 1.589           | 2.26            | 0.8788716814159293 |
| 10000000        | 14.072          | 12.157          | 1.25            |
| 20000000        | 27.321          | 22.288          | 1.25            |
| 40000000        | 53.277          | 42.531          | 1.25            |
-------------------------------------------------------------------------
|                                   | Total score:    | 4.628871681415929/5.0 |
--------------------------------------------------------------------------------------------------
Find_repeats Score Table:
-------------------------
-------------------------------------------------------------------------
| Element Count   | Ref Time        | Student Time    | Score           |
-------------------------------------------------------------------------
| 1000000         | 2.738           | 3.746           | 0.9136412172984516 |
| 10000000        | 20.158          | 20.511          | 1.25            |
| 20000000        | 39.049          | 38.149          | 1.25            |
| 40000000        | 76.606          | 70.088          | 1.25            |
-------------------------------------------------------------------------
|                                   | Total score:    | 4.663641217298451/5.0 |
-------------------------------------------------------------------------

render

实现一个简单的渲染器,refRenderer.cpp 文件已经给出了一个正确的串行方案,要求我们使用cuda语言实现并行版本。PA已经给出了一个可运行但是不正确的并行方案,它是以圆的颗粒度进行并行计算的,但是这破坏圆颜色之间的依赖关系,造成最终渲染结果的混乱。

__global__ void kernelRenderCircles() {int index = blockIdx.x * blockDim.x + threadIdx.x;if (index >= cuConstRendererParams.numCircles)  // index 是圆的索引return;// .....// for all pixels in the bonding boxfor (int pixelY=screenMinY; pixelY<screenMaxY; pixelY++) {float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + screenMinX)]);for (int pixelX=screenMinX; pixelX<screenMaxX; pixelX++) {float2 pixelCenterNorm = make_float2(invWidth * (static_cast<float>(pixelX) + 0.5f),invHeight * (static_cast<float>(pixelY) + 0.5f));shadePixel(index, pixelCenterNorm, p, imgPtr);imgPtr++;}}
}
实现1

基于PA readme中提示,“There are two potential axes of parallelism in this assignment. One axis is parallelism across pixels another is parallelism across circles ”, 可以基于图像的每个像素进行并行计算,在kernel函数中再按照圆的依赖顺序依次遍历并调用shadePixel进行渲染。这样的实现是正确的,但是性能很差:

__global__ 
void kernelRenderPixels()
{int index = blockIdx.x * blockDim.x + threadIdx.x;int imageWidth = cuConstRendererParams.imageWidth;int imageHeight = cuConstRendererParams.imageHeight;if (index > imageWidth * imageHeight) {return;}float invWidth = 1.f / imageWidth;float invHeight = 1.f / imageHeight;int pixelY = index / imageWidth;int pixelX = index % imageWidth;float2 pixelCenterNorm = make_float2(invWidth * (static_cast<float>(pixelX) + 0.5f),invHeight * (static_cast<float>(pixelY) + 0.5f));float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * index]);for (int circleIndex = 0; circleIndex < cuConstRendererParams.numCircles; ++circleIndex){float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * circleIndex]);shadePixel(circleIndex, pixelCenterNorm, circlePosition, imgPtr);}
}CudaRenderer::render() {// 256 threads per block is a healthy numberdim3 blockDim(256, 1);dim3 gridDim((image->width * image->height + blockDim.x - 1) / blockDim.x);kernelRenderPixels<<<gridDim, blockDim>>>();cudaDeviceSynchronize();
}

该实现的测试结果为:

--------------------------------------------------------------------------
| Scene Name      | Ref Time (T_ref) | Your Time (T)   | Score           |
--------------------------------------------------------------------------
| rgb             | 0.7622           | 0.6416          | 9               |
| rand10k         | 5.0429           | 77.2235         | 2               |
| rand100k        | 45.496           | 813.3614        | 2               |
| pattern         | 1.0921           | 8.9045          | 3               |
| snowsingle      | 29.2303          | 773.4755        | 2               |
| biglittle       | 27.4429          | 96.223          | 4               |
| rand1M          | 328.6162         | 8376.8588       | 2               |
| micro2M         | 606.1327         | 16774.7623      | 2               |
--------------------------------------------------------------------------
|                                    | Total score:    | 26/72           |
--------------------------------------------------------------------------
实现2

为了提升性能,我们把图片分成一个个16×16的小方块 ,每个threadblock负责这个小方块内的计算,主要包括:

  1. 并行地判断每个图片小方块是否与每个圆相交(并行判断256 个圆)
  2. 如果方块与圆相交,则再调用shadepixel进行渲染,如果不想交则直接跳过
#define BLOCKNUMX 16
#define BLOCKNUMY 16
#define BLOCKSIZE 256
__global__ 
void kernelRenderPixels()
{__shared__ int isBoxInCircle[BLOCKSIZE];int pixelX = blockIdx.x * blockDim.x + threadIdx.x;int pixelY = blockIdx.y * blockDim.y + threadIdx.y;int imageWidth = cuConstRendererParams.imageWidth;int imageHeight = cuConstRendererParams.imageHeight;float invWidth = 1.f / imageWidth;float invHeight = 1.f / imageHeight;if (pixelX >= imageWidth || pixelY >= imageHeight) {return;}int boxL = blockIdx.x * blockDim.x;int boxR = (min(blockIdx.x * blockDim.x + blockDim.x, imageWidth));int boxB = blockIdx.y * blockDim.y;int boxT = (min(blockIdx.y * blockDim.y + blockDim.y, imageHeight));float boxLInv = boxL * invWidth;float boxRInv = boxR * invWidth;float boxTInv = boxT * invHeight;float boxBInv = boxB * invHeight;int linearThreadIndex =  threadIdx.y * blockDim.x + threadIdx.x;float2 pixelCenterNorm = make_float2(invWidth * (static_cast<float>(pixelX) + 0.5f),invHeight * (static_cast<float>(pixelY) + 0.5f));for (int batchStartIndexForCircles = 0; batchStartIndexForCircles < cuConstRendererParams.numCircles;batchStartIndexForCircles += BLOCKSIZE){int indexForCircles = batchStartIndexForCircles + linearThreadIndex;if (indexForCircles >= cuConstRendererParams.numCircles){isBoxInCircle[linearThreadIndex] = 0;}else {float circleX = cuConstRendererParams.position[3 * indexForCircles];float circley = cuConstRendererParams.position[3 * indexForCircles + 1];float circleRadius = cuConstRendererParams.radius[indexForCircles];isBoxInCircle[linearThreadIndex] = circleInBoxConservative(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv) ? circleInBox(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv) : 0;}__syncthreads();for (int i = batchStartIndexForCircles; i< batchStartIndexForCircles + BLOCKSIZE; ++i){if (isBoxInCircle[i % BLOCKSIZE]){float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * i]);shadePixel(i, pixelCenterNorm, circlePosition, imgPtr);}}__syncthreads();}}void
CudaRenderer::render() 
{dim3 blockDim(BLOCKNUMX, BLOCKNUMY);dim3 gridDim((image->width + blockDim.x - 1) / blockDim.x, (image->height + blockDim.y - 1) / blockDim.y);kernelRenderPixels<<<gridDim, blockDim>>>();  
}

注意,最后一个__syncthreads是必须得加的,否则会导致线程安全问题(为了保护某一cuda线程在一次for循环中,isBoxInCircle不被复写)。最后再测下性能:

--------------------------------------------------------------------------
| Scene Name      | Ref Time (T_ref) | Your Time (T)   | Score           |
--------------------------------------------------------------------------
| rgb             | 0.8281           | 1.5998          | 6               |
| rand10k         | 5.1915           | 37.4328         | 3               |
| rand100k        | 45.6692          | 345.5081        | 3               |
| pattern         | 1.0687           | 4.5626          | 4               |
| snowsingle      | 28.8797          | 295.1067        | 2               |
| biglittle       | 26.8013          | 60.9962         | 6               |
| rand1M          | 339.1135         | 3399.8742       | 2               |
| micro2M         | 609.6217         | 6839.2456       | 2               |
--------------------------------------------------------------------------
|                                    | Total score:    | 28/72           |
--------------------------------------------------------------------------

Amazing!可以说是没有任何的性能提升呢......

实现3

没法了,去参考了别人的实现,发现他们除了对图像进行分块,还在第一个循环内使用sharedMemExclusiveScan进行优化:

__global__ 
void kernelRenderPixels()
{__shared__ uint isBoxInCircle[BLOCKSIZE];__shared__ uint prefixSumOutput[BLOCKSIZE];__shared__ uint prefixSumScratch[2 * BLOCKSIZE];__shared__ int inBoxCircleIndexes[BLOCKSIZE];// ...int linearThreadIndex =  threadIdx.y * blockDim.x + threadIdx.x;float2 pixelCenterNorm = make_float2(cuConstRendererParams.invWidth * (static_cast<float>(pixelX) + 0.5f),cuConstRendererParams.invHeight * (static_cast<float>(pixelY) + 0.5f));isBoxInCircle[linearThreadIndex] = 0;inBoxCircleIndexes[linearThreadIndex] = -1;for (int batchStartIndexForCircles = 0; batchStartIndexForCircles < cuConstRendererParams.numCircles;batchStartIndexForCircles += BLOCKSIZE){int indexForCircles = batchStartIndexForCircles + linearThreadIndex;if (indexForCircles < cuConstRendererParams.numCircles){float circleX = cuConstRendererParams.position[3 * indexForCircles];float circley = cuConstRendererParams.position[3 * indexForCircles + 1];float circleRadius = cuConstRendererParams.radius[indexForCircles];isBoxInCircle[linearThreadIndex] =  circleInBox(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv);}__syncthreads();// shoudl use sharedMemExclusiveScan to improve permance// but why the performance is ** far far** better than the one that do not do the execlusive scan improvement?sharedMemExclusiveScan(linearThreadIndex, isBoxInCircle, prefixSumOutput, prefixSumScratch, BLOCKSIZE);if (isBoxInCircle[linearThreadIndex]) {inBoxCircleIndexes[prefixSumOutput[linearThreadIndex]] = indexForCircles;}__syncthreads();int numOfIntescetedCircles = prefixSumOutput[BLOCKSIZE - 1] + isBoxInCircle[BLOCKSIZE - 1];for (int i = 0; i < numOfIntescetedCircles; ++i) // 只循环遍历实际相交的圆{float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * inBoxCircleIndexes[i]]);shadePixel(inBoxCircleIndexes[i], pixelCenterNorm, circlePosition, imgPtr);}}

起初我以为这个优化是微不足道的,因为第二层for循环最多也就循环256次(我认为,这和和测试集圆的数量——100k比起来可以约等于O(1)的复杂度),调用sharedMemExclusiveScan带来的cost可能还大于减少循环的收益。但测试结果显示,非常的Amazing啊,优化了非常多:

--------------------------------------------------------------------------
| Scene Name      | Ref Time (T_ref) | Your Time (T)   | Score           |
--------------------------------------------------------------------------
| rgb             | 0.6982           | 0.702           | 9               |
| rand10k         | 6.4562           | 7.4349          | 9               |
| rand100k        | 45.3627          | 55.6973         | 8               |
| pattern         | 1.0167           | 0.9699          | 9               |
| snowsingle      | 34.3391          | 37.9076         | 9               |
| biglittle       | 25.9874          | 53.0712         | 6               |
| rand1M          | 322.1914         | 325.6942        | 9               |
| micro2M         | 586.4424         | 597.0768        | 9               |
--------------------------------------------------------------------------
|                                    | Total score:    | 68/72           |
--------------------------------------------------------------------------

但是为什么呢?我只能猜测,实现3的for循环内部没有了if循环的分支,大大减少了execution divergence

// 实现2
for (int i = batchStartIndexForCircles; i< batchStartIndexForCircles + BLOCKSIZE; ++i)
{if (isBoxInCircle[i % BLOCKSIZE]){float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * i]);shadePixel(i, pixelCenterNorm, circlePosition, imgPtr);}
}
// 实现3
int numOfIntescetedCircles = prefixSumOutput[BLOCKSIZE - 1] + isBoxInCircle[BLOCKSIZE - 1];
for (int i = 0; i < numOfIntescetedCircles; ++i) // 只循环遍历实际相交的圆
{float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixelY * imageWidth + pixelX)]);float3 circlePosition = *(float3*)(&cuConstRendererParams.position[3 * inBoxCircleIndexes[i]]);shadePixel(inBoxCircleIndexes[i], pixelCenterNorm, circlePosition, imgPtr);
}

本想用ncu等性能测试工具验证这个猜想的,但是捣鼓了一晚上还是没对齐ncu、diver和cudaruntime的版本,所以就先这样吧。

此外,先调circleInBoxConservative再调circleInBox的方法,没有做到有效优化;把第一层for循环的if去掉了,也没什么优化:

    for (int batchStartIndexForCircles = 0; batchStartIndexForCircles < cuConstRendererParams.numCircles;batchStartIndexForCircles += BLOCKSIZE){int indexForCircles = batchStartIndexForCircles + linearThreadIndex;float circleX = cuConstRendererParams.position[3 * indexForCircles];float circley = cuConstRendererParams.position[3 * indexForCircles + 1];float circleRadius = cuConstRendererParams.radius[indexForCircles];isBoxInCircle[linearThreadIndex] =  circleInBox(circleX, circley, circleRadius, boxLInv, boxRInv, boxTInv, boxBInv);__syncthreads();// ...
http://www.jsqmd.com/news/1014567/

相关文章:

  • 解锁Paperless-ngx全球文档管理能力:多语言配置深度解析
  • 如何快速掌握AlienFX控制:开源工具终极指南解锁Alienware设备完全掌控
  • 技术深度解析:trace.moe 动漫场景向量搜索引擎架构设计与实战应用
  • 告别选择困难症:一张图看懂Activiti5/6/7的核心差异与适用场景
  • 从光线追踪实战看空间划分:手把手用C++实现简易BVH,对比KD-Tree性能差异
  • 膨化食品厂主要分布在哪里?国内主要产区对比
  • 数据开发半年工作后随感
  • python核心基础,这关于基于Moveltg加 Ros2实战Python编程基础实课
  • PowerPC架构SPR访问与AltiVec向量指令集实战解析
  • 2026年厦门正规靠谱婚恋服务/婚介门店TOP6排行大盘点:严肃婚恋平台专项测评 - 互联网科技品牌测评
  • 饮料厂主要分布在哪里?各产区有什么不同?
  • 别再只比性能了!UniApp和Flutter在2024年的真实项目落地成本大比拼
  • 3步破解默认密码困局:用Changeme防御企业安全最薄弱环节
  • 明日方舟终极助手MAA:一键自动化解放你的游戏时间
  • 2026年苏州律师推荐排行榜:刑事辩护/企业法律顾问/离婚财产分割/建筑工程纠纷/债权债务处置/劳动争议仲裁律师最新权威口碑解析 - 品牌发掘
  • 探索fSpy:解锁静态图像相机匹配的终极指南
  • 如何让旧款Mac免费升级最新macOS?OCLP-Mod完整指南
  • 3步解决ARK模组管理难题:TEKLauncher开源启动器的完整指南
  • 别再死记硬背了!用一张图搞懂HDLC、X.25、帧中继和ATM的演进关系
  • 089、Pre-commit Hooks 与 Claude Code:提交前自动检查、修复与拦截
  • Python 高手编程系列三千五百零二:处理错误与速率限制
  • 甲骨文云中国大陆定向 QoS 原理及绕过解决方案
  • 劳务中介服务核心技术拆解:百益人力的实战样本 - 奔跑123
  • 2026年 苏州律师/律师事务所推荐榜单:专业实力与贴心服务深度解析 - 品牌发掘
  • 如何永久保存你的微信记忆?WeChatMsg让聊天记录成为珍贵数字资产
  • 2026制药工业吸尘器TOP3品牌评价与推荐 - 工业清洁测评社
  • 2026中山中央空调回收品牌价差格力约克大金各值多少 - 广东再生资源回收
  • 终极指南:使用DDrawCompat在现代Windows上完美运行经典游戏
  • 跨源查询 30 倍提速:衡石 BI 多源异构数据关联技术深度解析
  • [T.18] 团队项目:Beta 阶段项目展示