当前位置: 首页 > news >正文

CANNBot Simulator V2参考文档

Simulator V2 Reference

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when the question is specifically about how simulator execution works now. Do not use it as a replacement for kernel-authoring or general architecture docs.

Goal

Capture the current simulator execution path so future work does not rely on removed or staleeasyasc/simulator/assumptions.

1. Current default

The repository's simulator path is now the V2 runtime.

Current behavior:

  • OpExec(..., simulator=True)enables simulator execution
  • OpExec(..., simulator="v2")is an accepted spelling for the same path
  • OpExec(..., simulator="legacy")is still accepted byOpExec, but it doesnotselect a separate old runtime anymore; it still routes to V2
  • KernelBase.run_sim()always calls_run_sim_v2()

Practical rule:

  • do not document or debug a separateeasyasc/simulator/runtime as if it were still active

2. How a kernel becomes a V2 program

The simulator build entry lives ineasyasc/kernelbase/kernelbase.py.

The selection order is:

  1. custom builder viakernel._simulator_v2_program_builder
  2. prebuilt program viakernel._simulator_v2_program
  3. auto analysis + auto bridge selection

Auto bridge selection:

  • if the instruction stream contains control-flow, topology queries,call_micro,VarList, or cross-lane sync helpers, V2 useseasyasc/simulator_v2/compat/control_flow_bridge.py
  • otherwise V2 uses the narrow linear bridge ineasyasc/simulator_v2/compat/kernel_bridge.py

Important difference:

  • control_flow_bridge.pypreserves loops/conditionals and defers resolution to the runtime
  • kernel_bridge.pyonly covers a narrower linear lowered-instruction subset

3. Runtime stack

The runtime is split across these layers:

  • parent coordinator:easyasc/simulator_v2/runtime/global_runtime.py
  • core process wrapper:easyasc/simulator_v2/runtime/core_process.py
  • per-core runtime:easyasc/simulator_v2/runtime/core_runtime.py
  • lane-level control interpreter:easyasc/simulator_v2/runtime/control_actor.py
  • pipe worker threads:easyasc/simulator_v2/runtime/pipe_worker.py
  • pipe executors:easyasc/simulator_v2/ops/

Execution shape:

  • one parentGlobalRuntime
  • one childCoreProcessper simulated core
  • inside each active core, oneControlActorper active lane
  • inside each lane, one threadedPipeWorkerper logical pipe

Launch rule:

  • start simulator repros from a real.pyfile, not fromstdinentry points such aspython - <<'PY'or piped scripts
  • V2 uses multiprocessing during startup, and Python spawn must be able to re-import__main__from a real filesystem path;stdinentry points appear as<stdin>and break child startup
  • when the launcher lives outside the repo root, include the repo root inPYTHONPATHso child processes can import local modules consistently
  • safe pattern:PYTHONPATH=/abs/path/to/repo python /tmp/repro.py

Completion / shutdown facts:

  • pipe workers already stop through mailbox sentinels; the thread layer does not need a special end instruction
  • parent / child completion now uses a one-shot status channel that the parent polls while joining
  • GlobalRuntime.run()uses one global execution deadline across all active cores, not a full timeout budget per core in sequence

4. Planning and activation

Core and lane activation are resolved by:

  • easyasc/simulator_v2/config.py
  • easyasc/simulator_v2/runtime/execution_plan.py
  • easyasc/simulator_v2/helpers.py

Key facts:

  • default core count follows the active device family (950 -> 32,b3 -> 20)
  • V2 can skip inactive lanes when a program only uses a subset of cube/vec lanes
  • collective ops (allcube_*,allvec_*) affect lane-activation planning

5. Memory and tensor state

Shared tensor setup lives in:

  • easyasc/simulator_v2/memory/shared_tensor.py
  • easyasc/simulator_v2/memory/shared_tensor_store.py
  • easyasc/simulator_v2/memory/tensor_view.py
  • easyasc/simulator_v2/memory/workspace.py
  • easyasc/simulator_v2/memory/local_memory.py

Important facts:

  • OpExecclones input tensors intoGMTensor.data
  • V2 copies that payload into the shared runtime tensor store before execution
  • after execution, V2 copies runtime tensors back into the boundGMTensor.data
  • workspaces and local buffers are represented as shared-tensor specs in program metadata
  • child-core local tensors now go through a bank-aware allocator (UB0/UB1/L1/L0A/L0B/L0C); over-capacity local allocations fail before pipe execution starts
  • runtime-created local slice snapshots must treat a root local tensor'sSharedTensorSpec.storage_offsetas allocator bookkeeping in bytes rather than as an extra in-storage element offset; only nested local views should re-apply a parentstorage_offsetwhencontrol_actor.pymaterializes a dynamic slice
  • simulator-side GMatomic_add/atomic_max/atomic_minnow serialize their read-modify-write sections through a shared store-wide atomic lock so cross-core atomic writebacks do not lose updates under contention

Regression note:

  • testcases/simulator/memory/test_simulator_v2_slice_tensor.pycovers the sliced-UB vec-mul case where several prefix UB allocations push the sliced root tensor onto a non-zero local bank offset before runtime snapshotting

6. Sync and control

The main sync/control pieces are:

  • intra-core sync:easyasc/simulator_v2/sync/intra_core_sync.py
  • collective sync:easyasc/simulator_v2/sync/collective_sync.py
  • lane-local flags:easyasc/simulator_v2/sync/local_flags.py
  • lane-local events:easyasc/simulator_v2/sync/local_events.py
  • worker mailboxes:easyasc/simulator_v2/sync/mailbox.py

Important fact:

  • collective sync state is process-shared at runtime;GlobalRuntimesnapshots the parentCollectiveSyncand each child core reloads that shared state instead of creating a private per-process coordinator
  • lane-localbarrier(pipe=...)currently has special runtime behavior only forbarrier(ALL); non-ALLbarriers are preserved as control instructions but act as no-ops in the V2 runtime main loop
  • practical consequence for kernel debugging:bar_v()/bar_mte2()/ other single-pipe barriers do not serialize cross-pipe edges such asV -> MTE2on the simulator path; when a repro needs a simulator-visible local drain across pipe domains, usebar_all()
  • setflag/waitflagstill use the phase-basedLocalFlagTable, but localSEvent/DEventno longer do: V2 now models them with a per-lane flag bank keyed by(src_pipe, dst_pipe, flag_id)and a bool value per flag
  • create_seventallocates oneflag_idfrom the lane-local pool for its(src_pipe, dst_pipe)pair;create_deventallocates two consecutive ids from that same pair-local pool
  • SEvent.set()sets its single flag to1and errors if it is already1;SEvent.wait()blocks until that flag becomes1, then clears it back to0
  • DEventkeeps two independent bool flags plus separateset_count/wait_countcursors: the producer-sidesetpath alternatesflag0, flag1, flag0, ..., and the consumer-sidewaitpath alternates on its own cursor over the same two flags
  • event_setallis modeled as repeatedset()calls on the same event object rather than as a special bulk primitive; forDEventthat usually means setting both flags in rotation order, whileSEvent.setall()will replayset()twice and therefore errors on the second call if the single flag is still set
  • event_releaseis modeled as repeatedwait()calls:SEvent.release()performs one wait, whileDEvent.release()performs one wait and then performs a second wait only when a second outstanding token is already pending on the other rotated flag
  • practical consequence for trace/timing work: local event blocking must now be reasoned about per realflag_id, not perevent_name
  • regression coverage:testcases/simulator/bridge/test_simulator_v2_control_flow.py

When debugging a hang:

  • inspect the original failing lane error first
  • then inspect the sync state / timeout diagnostic
  • do not assume the timeout itself is the root cause

When a child core raises an exception:

  • GlobalRuntime.run()now raises the combined per-core traceback text directly
  • do not rely on a generic parent-side wrapper message; the actionable failure should already be in the thrown exception string
  • pipe-worker instruction failures now print an immediatestderrlog withlane/pipe/opname/error, control-sidewait_*paths poll worker failures while waiting, andCoreRuntime.join()prefers surfacing the more actionable worker/task failure over a secondary sync-timeout symptom when multiple lane actors fail

7. Trace path

Trace recording lives in:

  • easyasc/simulator_v2/trace/recorder.py
  • easyasc/simulator_v2/trace/merge.py
  • easyasc/simulator_v2/trace/chrome.py
  • a5 cycle-model profile and estimators:easyasc/simulator_v2/timing/

Runtime flow:

  • each core records its own events
  • parent runtime merges them after execution
  • dump_chrome_trace(...)exports Chrome/Perfetto-style JSON
  • runtime event timestamps originate fromtime.monotonic()
  • exported Chrome traces normalize those timestamps into a per-run relative axis instead of replacing them with event-order indices
  • exporteddurnow reflects measured task/wait spans when the runtime recorded them; zero-duration control markers still use a tiny fallback width only to stay visible in viewers
  • sync-heavy kernels may now emit explicitsynctrace events for wait/ready phases in addition to pipe execution events
  • on a5 (device_type == "950"), the runtime can now switch trace timing to a cycle-model domain driven by the JSON profile undertiming/; in that modeeasyasc_time_domain == "cycle"is exported in the trace payload and task args include the modeling breakdown
  • current a5 cycle-model defaults treat one ordinary V-pipe instruction as2cycles
  • forcall_micro/@vf()timing, register <-> UB shuffle instructions are counted as0cycle:micro_ub2reg,micro_reg2ub,micro_ub2regcont,micro_reg2ubcont
  • in cycle-model mode, direct control-side waits (event_wait,wait_vec,wait_cube, collective waits) now advance the control actor's cycle cursor, butevent_setno longer acts as a lane-global block for later unrelated pipe dispatch; its ready time is derived from the completed source pipe, and unrelated pipes can start as soon as their own event dependencies are satisfied
  • lane-localevent_wait/event_releasecan now be lowered into the destination pipe worker queue, so the blocking happens on that pipe thread instead of only on the control actor;event_set/event_setallintentionally stay control-side because their position in the instruction stream still defines autosync lifetime boundaries
  • trace export now consultsglobvars.trace_event(defaultFalse): when disabled, all sync-style trace markers are omitted from dispatch, pipe, and sync tracks, including lane-localevent_*, local flag waits, intra-core handoff ops such aswait_vec/cube_ready, and collectiveall*sync ops; tests or debugging sessions that need those markers must enable the flag explicitly before running the simulator
  • when optimizing from the trace view, keepglobvars.trace_eventat its defaultFalseunless the specific goal is to inspect sync/event behavior; turning it on adds sync markers that are useful for debugging but can distract from the steady-state scheduling picture you usually want for optimization work
  • when optimizing cycle count from a trace, use the trace makespan as the objective: the cycle at which the last timed event finishes (max(ts + dur)overph == "X"events). Do not optimize for the sum of all timed durations or "total activated cycles"; those overcount parallel overlap and can rank kernels differently from the real end-to-end completion time

8. Vec and micro execution

Key implementation files:

  • vec runtime entry:easyasc/simulator_v2/ops/vec/v.py
  • vec legacy-layout helper:easyasc/simulator_v2/ops/vec/_legacy_vpipe.py
  • vec MTE2 path:easyasc/simulator_v2/ops/vec/mte2.py
  • vec MTE3 path:easyasc/simulator_v2/ops/vec/mte3.py
  • micro runtime:easyasc/simulator_v2/ops/micro/runtime.py
  • pipe dispatch:easyasc/simulator_v2/ops/dispatch.py

Important fact:

  • several vec operations still reuse the legacy layout executor throughops/vec/_legacy_vpipe.py, but they run inside the V2 runtime
  • whengm_to_ub_padorl0c_to_gm_nz2ndreports a source/destination view that is "too small" on an a2 workspace-mediated tail path, first inspect whether the workspace view was cropped in the column dimension; those bridge ops infer row-stride from the parent GM shape, so a cropped workspace column span can fail even when the logical tail math is correct
  • all UB burst copy ops (gm_to_ub_padinops/vec/mte2.py,ub_to_gm_padandub_to_l1_nzinops/vec/mte3.py) use_linear_view_from_pointerso that column-sliced UB views (ub[:, 0:valid_n]withvalid_n < buffer_cols) round-trip through the underlying storage; any new burst-style op must mirror this pattern or it will falsely raise "view is too small" when the destination is non-contiguous
  • regression coverage:testcases/simulator/datamove/test_gm_to_ub_pad_column_slice.py

Scalar-semantics reminder:

  • control_flow_bridge.pypreservesVararithmetic as runtime scalar ops such asvar_add,var_mul, andvar_div
  • control_actor.pyandops/micro/runtime.pymust preserve floatVarsemantics for those ops; do not silently coerce float scalar expressions to int on the runtime path
  • practical symptom of a broken float-scalar path: raw cube/UB data looks correct, but a later@vf()stage that multiplies by a computed scale suddenly collapses to0

9. Best first files for simulator debugging

  • easyasc/kernelbase/kernelbase.py
  • easyasc/simulator_v2/compat/control_flow_bridge.py
  • easyasc/simulator_v2/compat/kernel_bridge.py
  • easyasc/simulator_v2/runtime/control_actor.py
  • easyasc/simulator_v2/runtime/task_memory_validator.py
    • pre-dispatch memory-range checks now cover shared-tensor helpers, all current cube-pipe tensor ops, vec datamoves, V-pipe tensor ops including packedcompare/select, repeat-layout vec instructions,sort32,mergesort*,gather,scatter, task-level micro shared-tensor ops, andcall_microdry-run validation
  • easyasc/simulator_v2/runtime/pipe_worker.py
  • easyasc/simulator_v2/runtime/global_runtime.py
  • testcases/simulator/

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

http://www.jsqmd.com/news/785385/

相关文章:

  • 为Claude Code配置稳定可靠的Taotoken后端以解决访问限制
  • ESP32+485模块实战:手把手教你用Arduino IDE读取电磁流量计数据(附完整代码)
  • YOLOv11野生动物园大型猫科动物目标检测数据集-8075张-Animal-detection-yolov8-1
  • Android设备本地HTTP API服务:原理、实现与自动化实践
  • 2026年重磅发布:硬核测评5大吸塑包装内衬源头厂商避坑手册+选型技巧
  • 2026年华东屏蔽设备服务商推荐:常州新马屏蔽设备,以专业电磁防护技术守护信息与设备安全 - 海棠依旧大
  • 2026年广州档案服务标杆服务商最新推荐:广州创科绿农数字信息技术有限公司,专注档案储存、整理、电子档案、卷宗处理、档案销毁、智能档案管理,以数字化技术守护信息资产安全 - 海棠依旧大
  • 告别任务管理器!用Python的psutil库打造你的专属系统监控面板(附完整代码)
  • 可解释AI的对抗攻击与防御:从SHAP/LIME脆弱性到鲁棒性实践
  • Anyquery:用SQL统一查询异构数据源,打破数据孤岛
  • 洛谷P14919[GESP202512 六级] 路径覆盖
  • 别再猜了!用Python+SimpleITK 5分钟搞定DICOM图像像素间距读取与比例尺换算
  • 标准库 vs HAL库:我该选哪个入门STM32?从新建工程步骤差异聊透你的第一个选择
  • 开源技能模块开发实战:从微内核架构到插件化生态构建
  • 从原理到代码:手撕Matlab畸变矫正算法,彻底搞懂内参矩阵与径向畸变参数
  • 从每天加班到准时下班,我用创客兔AI超级员工系统“解放”了整个营销部 - 速递信息
  • taotoken官方折扣活动与按token计费模式详解
  • 对比直连厂商Taotoken在多模型聚合与统一计费上的便捷体验
  • Linux内核升级翻车实录:一次由apt autoremove引发的Kernel panic及完整修复过程
  • AI绘画:从工具到协作伙伴的范式转变与实战指南
  • 爬虫攻防实战:一文吃透主流反爬机制与破解之道
  • 2026年上海公墓选购指南:海湾园公墓,以人文生态承载思念,守护生命最后尊严 - 海棠依旧大
  • 大语言模型伦理治理:责任、安全与稳健性三大原则的工程实践
  • 数控加工中的GLTF/GLB文件:设计与制造的桥接
  • 2026年华南陵园公墓选购指南:传统与生态葬式齐全,以人文环境承载缅怀思念 - 海棠依旧大
  • AI工具调用可视化调试器:提升智能体开发与调试效率
  • 保姆级教程:用ObjectDatasetTools生成Linemod数据集后,如何一步步搞定Linemod_preprocessed预处理
  • 从P5到P7:一个普通程序员在阿里的三年真实成长记录与心得
  • Nodejs后端如何为在线服务集成多模型AI能力
  • 构建代码洞察平台:从数据采集到可视化,提升工程效能