[20260429]21c下设置pre_page_sga=true使用hugepages的疑问3.txt
--//补充测试,最近的测试21c下设置pre_page_sga=true使用hugepages时,总发现存在小量的hugepages没有分配,跟11g存在差异,11g
--//是全部分配。
--//使用kimi提问,在逐步启发以后,给出的解答如下:
真正原因:12c+ 的 SGA 共享内存段分配方式变了
从 Oracle 11.2.0.3 / 12c 开始,Oracle 为了更灵活地使用 HugePages,改变了 SGA 的共享内存分配策略:
11g:SGA 通常作为单个连续共享内存段分配,pre_page_sga=true 启动时一次性 touch 整个段,所有页都被映射,因此 HugePages_Rsvd 为 0。
12c+:SGA 被拆分为多个共享内存段(ipcs -m 可见多个段,常见为一个主段 + 一个较小的辅助段)。每个段独立在 HugePage(2MB)边界
上对齐,段与段之间会产生少量间隙页。这些间隙页已被内核预留(计入 Rsvd),但 pre_page_sga的 touch 操作只覆盖实际数据页,不
会读写纯粹用于对齐边界的间隙页,于是残留为 HugePages_Rsvd。
你看到的 7 个页(14MB) 正是这种多段对齐开销——在 11g 的单段模式下不会产生,但在 12c+ 的多段模式下必然出现。
--//昨天通过测试如下:
$ grep -i hugepages /proc/meminfo
AnonHugePages: 40960 kB
HugePages_Total: 530
HugePages_Free: 7
HugePages_Rsvd: 7
HugePages_Surp: 0
Hugepagesize: 2048 kB
--//HugePages_Rsvd=7,还有7个hugepages没有touch,为什么?
$ cat /proc/$(pgrep pmon)/maps | grep "rw-s"
60000000-60a00000 rw-s 00000000 00:0c 0 /SYSV00000000 (deleted)
61000000-a2000000 rw-s 00000000 00:0c 32769 /SYSV00000000 (deleted)
a2000000-a2800000 rw-s 00000000 00:0c 65538 /SYSV00000000 (deleted)
a3000000-a3200000 rw-s 00000000 00:0c 98307 /SYSVafa94c20 (deleted)
7f3764b20000-7f3764b21000 rw-s 00000000 08:11 18861347 /u01/app/oracle/dbs/hc_book.dat
--//如果段与段之间会产生少量间隙页,看看间歇有多大?
--//看看第1行与第2行的共享内存段的间歇:
--//0x61000000-0x60a00000 = 0x600000 = 6291456
--//6291456/2/1024/1024 = 3
--//第2行与第3行的共享内存段不存在间隙.
--//第3行与第4行的共享内存段存在间隙.:
--//0xa3000000-0xa2800000 = 0x800000 = 8388608
--//8388608/2/1024/1024 = 4
--//3+4确实等于7,当时测试完成有种测试仅仅是巧合.
--//如果这样实际需要HugePages_Total= 530+7 = 537.
--//总之有太多的疑问,自己也想通过修改参数sga_target之类的参数,验证以上判断是否正确.
--//换一个方法修改内核参数kernel.shmmax 看看.
1.测试前准备:
# cat /etc/sysctl.d/98-oracle.conf
fs.file-max = 6815744
kernel.sem = 250 32000 100 128
kernel.shmmni = 4096
kernel.shmall = 1073741824
#kernel.shmmax = 4398046511104
kernel.shmmax = 268435456
kernel.panic_on_oops = 1
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.rp_filter = 2
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
#vm.nr_hugepages = 530
#vm.nr_overcommit_hugepages = 512
vm.nr_hugepages = 530
vm.nr_overcommit_hugepages = 50
--//说明:开始设置kernel.shmmax = 4398046511104,单位字节相当于4T,理论我的测试机器不会有这么大的内存,设置相当于最大共享内
--//存段4T,现在修改为256*1024*1024 = 268435456,即256M.
2.测试:
--//首先使内核参数生效。
# sysctl -p /etc/sysctl.d/98-oracle.conf
fs.file-max = 6815744
kernel.sem = 250 32000 100 128
kernel.shmmni = 4096
kernel.shmall = 1073741824
kernel.shmmax = 268435456
kernel.panic_on_oops = 1
net.core.rmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_default = 262144
net.core.wmem_max = 1048576
net.ipv4.conf.all.rp_filter = 2
net.ipv4.conf.default.rp_filter = 2
fs.aio-max-nr = 1048576
net.ipv4.ip_local_port_range = 9000 65500
vm.nr_hugepages = 530
vm.nr_overcommit_hugepages = 50
--//启动数据库:
SYS@book> startup
ORACLE instance started.
Total System Global Area 1107294056 bytes
Fixed Size 9684840 bytes
Variable Size 654311424 bytes
Database Buffers 436207616 bytes
Redo Buffers 7090176 bytes
Database mounted.
Database opened.
SYS@book> @ hidez ^pre_page_sga|^use_large_pages
NUM N_HEX CON_ID NAME DESCRIPTION DEFAULT_VALUE SESSION_VALUE SYSTEM_VALUE ISSES ISSYS_MOD
--- ----- ------ --------------- ---------------------------------------------- ------------- ------------- ------------ ----- ---------
180 B4 0 use_large_pages Use large pages if available (TRUE/FALSE/ONLY) FALSE ONLY ONLY FALSE FALSE
193 C1 0 pre_page_sga pre-page sga for process TRUE TRUE TRUE FALSE FALSE
--//看看hugepages的使用情况:
# grep -i hugepage /proc/meminfo
AnonHugePages: 32768 kB
HugePages_Total: 530
HugePages_Free: 28
HugePages_Rsvd: 28
HugePages_Surp: 0
Hugepagesize: 2048 kB
--//这次与前面不同HugePages_Rsvd:28。
# ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 0 oracle 600 10485760 57
0x00000000 32769 oracle 600 268435456 57
0x00000000 65538 oracle 600 268435456 57
0x00000000 98307 oracle 600 268435456 57
0x00000000 131076 oracle 600 268435456 57
0x00000000 163845 oracle 600 16777216 57
0x00000000 196614 oracle 600 8388608 57
0xafa94c20 229383 oracle 600 2097152 57
--//10485760 /2/1024/1024 = 5
--//268435456/2/1024/1024 = 128
--//268435456/2/1024/1024 = 128
--//268435456/2/1024/1024 = 128
--//268435456/2/1024/1024 = 128
--//16777216 /2/1024/1024 = 8
--//8388608 /2/1024/1024 = 4
--//2097152 /2/1024/1024 = 1
--//Sum = 530
# ipcs -mu
------ Shared Memory Status --------
segments allocated 8
pages allocated 271360
pages resident 257024
pages swapped 0
Swap performance: 0 attempts 0 successes
--//分成8个共享内存段。
# cat /proc/$(pgrep pmon)/maps | grep rw-s | nl
1 60000000-60a00000 rw-s 00000000 00:0c 0 /SYSV00000000 (deleted)
2 61000000-71000000 rw-s 00000000 00:0c 32769 /SYSV00000000 (deleted)
3 71000000-81000000 rw-s 00000000 00:0c 65538 /SYSV00000000 (deleted)
4 81000000-91000000 rw-s 00000000 00:0c 98307 /SYSV00000000 (deleted)
5 91000000-a1000000 rw-s 00000000 00:0c 131076 /SYSV00000000 (deleted)
6 a1000000-a2000000 rw-s 00000000 00:0c 163845 /SYSV00000000 (deleted)
7 a2000000-a2800000 rw-s 00000000 00:0c 196614 /SYSV00000000 (deleted)
8 a3000000-a3200000 rw-s 00000000 00:0c 229383 /SYSVafa94c20 (deleted)
9 7f9f6222a000-7f9f6222b000 rw-s 00000000 08:11 18861347 /u01/app/oracle/dbs/hc_book.dat
--//仔细看第2,3,4,5,6,7之间共享内存段使连续的不存在空隙。
--//0x61000000-0x60a00000 = 6291456,6291456/2/1024/1024 = 3
--//0xa3000000-0xa2800000 = 8388608, 8388608/2/1024/1024 = 4
--//可以发现kimi给出的解析就错的离谱了,HugePages_Rsvd=28只能认为oracle 21c改变了touch内存的方法。
3.继续:
--//查了一些资料,发现还有1个隐含参数_touch_sga_pages_during_allocation(11g下没有该参数)。
SYS@book> @ hidez _touch_sga_pages_during_allocation
SYS@book> @ pr
==============================
NUM : 179
N_HEX : B3
CON_ID : 0
NAME : _touch_sga_pages_during_allocation
DESCRIPTION : touch SGA pages during allocation
DEFAULT_VALUE : TRUE
SESSION_VALUE : FALSE
SYSTEM_VALUE : FALSE
ISSES_MODIFIABLE : FALSE
ISSYS_MODIFIABLE : FALSE
PL/SQL procedure successfully completed.
$ cat /u01/app/oracle/dbs/initbook.ora
SPFILE='/u01/app/oracle/dbs/spfilebook.ora'
_touch_sga_pages_during_allocation=true
SYS@book> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SYS@book> startup pfile=/u01/app/oracle/dbs/initbook.ora
ORACLE instance started.
Total System Global Area 1107294056 bytes
Fixed Size 9684840 bytes
Variable Size 654311424 bytes
Database Buffers 436207616 bytes
Redo Buffers 7090176 bytes
Database mounted.
Database opened.
# grep -i hugepage /proc/meminfo
AnonHugePages: 120832 kB
HugePages_Total: 530
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
--//这次全部分配。视乎该参数_touch_sga_pages_during_allocation才会touch全部页面表。
# ipcs -m --human
------ Shared Memory Segments --------
key shmid owner perms size nattch status
0x00000000 262144 oracle 600 10M 56
0x00000000 294913 oracle 600 256M 56
0x00000000 327682 oracle 600 256M 56
0x00000000 360451 oracle 600 256M 56
0x00000000 393220 oracle 600 256M 56
0x00000000 425989 oracle 600 16M 56
0x00000000 458758 oracle 600 8M 56
0xafa94c20 491527 oracle 600 2M 56
# cat /proc/$(pgrep pmon)/maps | grep rw-s | nl
1 60000000-60a00000 rw-s 00000000 00:0c 262144 /SYSV00000000 (deleted)
2 61000000-71000000 rw-s 00000000 00:0c 294913 /SYSV00000000 (deleted)
3 71000000-81000000 rw-s 00000000 00:0c 327682 /SYSV00000000 (deleted)
4 81000000-91000000 rw-s 00000000 00:0c 360451 /SYSV00000000 (deleted)
5 91000000-a1000000 rw-s 00000000 00:0c 393220 /SYSV00000000 (deleted)
6 a1000000-a2000000 rw-s 00000000 00:0c 425989 /SYSV00000000 (deleted)
7 a2000000-a2800000 rw-s 00000000 00:0c 458758 /SYSV00000000 (deleted)
8 a3000000-a3200000 rw-s 00000000 00:0c 491527 /SYSVafa94c20 (deleted)
9 7f4ef6faf000-7f4ef6fb0000 rw-s 00000000 08:11 18861347 /u01/app/oracle/dbs/hc_book.dat
4.小结:
--//kimi,deepseek检索并不是非常靠谱,依靠它查询一些通常的问题非常准确。复杂的问题不行。
--//设置pre_page_sga=true使用hugepages的疑问不再深究,已经超出自己的能力范围.
--//顺着kimi给的相关链接: https://fritshoogland.wordpress.com/2016/05/27/oracle-sga-memory-allocation-on-startup/
--//里面提到设置sga_target=10T,启动异常缓慢,对方设置PRE_PAGE_SGA=false,而_touch_sga_pages_during_allocation=true。
--//而且还提到在oracle 12.1.0.2版本缺省为true。这里也提供线索,可能在oracle 12.1.0.2版本缺省PRE_PAGE_SGA=false(存疑),
--//_touch_sga_pages_during_allocation=true。而21c的版本有反了过来。
--//转抄其中一段内容https://fritshoogland.wordpress.com/2016/05/27/oracle-sga-memory-allocation-on-startup/
At this point the reason for having _TOUCH_SGA_PAGES_DURING_ALLOCATION should be clear. The question I had on this point
is: but how about PRE_PAGE_SGA? In essence, this parameter is supposed to more or less solve the same issue, having the
SGA pages being touched at startup to prevent paging for foreground sessions.
此时,设置_TOUCH_SGA_pages_DURING_ALLOCATION的理由应该很清楚了。我在此处的疑问是:PRE_PAGE_SGA呢?本质上,这个参数的作用
与之类似,即在启动时触达 SGA 页面,以避免为前台会话进行分页。
BTW, if you read about PRE_PAGE_SGA in the online documentation, it tells a reason for using PRE_PAGE_SGA, which is not
true (page table entries are prebuilt for the SGA pages), and it indicates the paging (=page faults) are done at
startup, which also is not true. It also claims 'every process that starts must access every page in the SGA', again
this is not true.
顺便说一句,如果你查阅在线文档中关于PRE_PAGE_SGA的说明,它会给出使用PRE_PAGE_SGA的理由,但这个理由并不成立(页表项是为
SGA 页面预先构建的),同时指出分页(即页错误)是在启动时完成的,这同样不成立。文档还声称'每个启动的进程都必须访问 SGA 中的
每个页面',这一点同样不成立。
From what I can see, what happens when PRE_PAGE_SGA is set to true, is that a background process is started, that starts
touching all SGA pages AFTER the instance has started and is open for usage. The background process I witnessed is
'sa00'. When recording the backtraces of that process, I see:
据我观察,当PRE_PAGE_SGA设置为true时,系统会启动一个后台进程,该进程会在实例启动后开始扫描所有已打开的 SGA 页面。我观察
到的后台进程名为'sa00'。在记录该进程的回溯日志时,我看到:
--//我的测试在21c实际上PRE_PAGE_SGA=false,也会启动后台进程sa00。
The kernel paging functions are exactly the same as we have seen several times now. It's clear the functions executed by
this process are specifically for the prepage functionality. The pre-paging as done on behalf of
_TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE is done as part of the SGA creation and allocation (as can be seen by the Oracle
function names). PRE_PAGE_SGA seems to be a 'workaround' if you don't want to spend the long time paging on startup, but
still want to page the memory as soon as possible after startup. Needless to say, this is not the same as
_TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE, PRE_PAGE_SGA paging is done serially by a single process after startup when the
database is open for usage. So normal foreground process that encounter non-paged memory, which means they use it before
the sa00 process pages it, still need to do the paging.
内核分页功能与我们之前多次讨论的内容完全一致。显然,该进程执行的功能专门用于预分页操作。当启用
_TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE时,预分页操作会作为 SGA 创建和分配过程的一部分完成(从Oracle函数名称即可看出)。
PRE_PAGE_SGA似乎是为避免启动时耗时过长的分页操作而设计的解决方案,但同时仍希望在启动后尽快完成内存分页。需要说明的是,这
与TOUCH_SGA_PAGES_DURING_ALLOCATION=TRUE的情况不同——后者是在数据库开放使用后,由启动时的单个进程串行执行的。因此,遇到
未分页内存的常规前台进程(即在sa00进程完成分页前使用该内存的进程)仍需执行分页操作。
Conclusion
结论
If you want to allocate a large SGA with Oracle 12.1.0.2 (but may apply to earlier versions too), the startup time could
be significant. The reason for that is the bequeathing session pages the memory on startup. This can be turned off by
setting the undocumented parameter _TOUCH_SGA_PAGES_DURING_ALLOCATION to FALSE. As a result, foreground (normal user)
sessions need to do the paging. You can set PRE_PAGE_SGA parameter to TRUE to do paging, however the paging is done by a
single process (sa00) that serially pages the memory after startup. Foreground processes that encounter non-paged
memory, which means they use it before the sa00 process could page it, need to page it theirselves.
若想在Oracle 12.1.0.2版本中分配大 SGA(该方法同样适用于早期版本),系统启动时间可能会显著增加。这是因为系统启动时会自动进
行内存分页。通过将未公开参数_TOUCH_SGA_pages_DURING_ALLOCATION设置为false,可关闭此功能。此时前台进程(普通用户操作)需要
自行执行内存分页。若需启用分页功能,可将PRE_PAGE_SGA参数设为TRUE,但此时分页将由单一进程(sa00)在系统启动后顺序执行。当前
台进程遇到未分页内存(即在sa00进程完成分页前使用该内存)时,必须自行执行分页操作。
--//我的测试在21c实际上PRE_PAGE_SGA=false,也会启动后台进程sa00。
5.收尾:
--//修改/etc/sysctl.d/98-oracle.conf文件,过程略。
