当前位置：首页 > news >正文

正则表达式详解（C++20 ）

news 2026/6/29 23:22:29

正则表达式详解（C++20 ）

1. 什么是正则表达式

正则表达式（Regular Expression，简称 regex）是一种用于描述字符串匹配模式的强大工具。它本质上是一种微型的领域特定语言，通过特定的语法规则来定义一组字符串的集合。正则表达式广泛应用于：

输入验证（邮箱、电话、URL、密码强度等）
文本搜索与提取（日志分析、数据抓取）
查找替换（敏感词过滤、格式化整理）
编译器词法分析、语法高亮等

在 C++20 中，标准库<regex>提供了完整的正则表达式支持，包括匹配、搜索、替换和迭代等功能。

2. 正则表达式基本语法一览

这里以默认的 ECMAScript 语法（JavaScript 风格）为例，这也是 C++std::regex的默认语法。大多数通用 regex 知识在此适用。

2.1 普通字符与元字符

普通字符（字母、数字、空格等）匹配自身。
元字符有特殊含义：. ^ $ * + ? { } [ ] \ | ( )

若要匹配元字符本身，需用反斜杠\转义。在 C++ 代码中，反斜杠本身需要转义，因此推荐使用原始字符串字面量R"(...)"，避免转义地狱。

2.2 字符类

模式	说明
`[abc]`	匹配 a、b 或 c 中的任意一个字符
`[^abc]`	匹配除 a、b、c 外的任意一个字符（否定）
`[a-z]`	匹配 a 到 z 的任意小写字母
`.`	匹配除换行符外的任意单个字符
`\d`	匹配一个数字，等价于`[0-9]`
`\D`	匹配一个非数字，等价于`[^0-9]`
`\w`	匹配一个单词字符（字母、数字、下划线），等价于`[A-Za-z0-9_]`
`\W`	匹配一个非单词字符
`\s`	匹配一个空白字符（空格、制表符、换行等）
`\S`	匹配一个非空白字符

2.3 量词（重复次数）

模式	说明
`*`	前一表达式出现 0 次或多次
`+`	前一表达式出现 1 次或多次
`?`	前一表达式出现 0 次或 1 次
`{n}`	前一表达式恰好出现 n 次
`{n,}`	前一表达式出现至少 n 次
`{n,m}`	前一表达式出现 n 到 m 次

默认是贪婪匹配，量词后面加?变为非贪婪匹配（如*?,+?,??）。

2.4 定位符（锚点）

模式	说明
`^`	匹配字符串开头
`$`	匹配字符串结尾
`\b`	匹配单词边界
`\B`	匹配非单词边界

2.5 分组与捕获

(pattern)：捕获组，匹配并捕获内容，可通过编号访问。
(?:pattern)：非捕获组，只匹配不捕获，不产生反向引用。
\1,\2…：反向引用，匹配与第 n 个捕获组相同的内容。
(?'name'pattern)或(?<name>pattern)：命名捕获组（C++ 中需std::regex::ECMAScript并注意支持情况，std::regex本身不直接支持命名捕获，可用编号替代）。

2.6 零宽断言

模式	说明
`(?=p)`	正向先行断言，要求后面是 p，但不消耗字符
`(?!p)`	负向先行断言，要求后面不是 p
`(?<=p)`	正向后发断言，要求前面是 p（C++`std::regex`不完全支持可变宽度后发断言）
`(?<!p)`	负向后发断言，要求前面不是 p

std::regex对后发断言支持有限，使用时需测试。

3. C++20 正则表达式库核心组件

3.1 头文件与主要类

#include <regex>

std::regex：存储编译后的正则表达式（基于模板std::basic_regex<char>）。
std::wregex：用于宽字符的正则表达式。
std::cmatch/std::smatch：匹配结果集，分别对应 C 风格字符串和std::string。
std::sub_match：子匹配结果，代表一个捕获组。

3.2 常用匹配函数

函数	作用
`std::regex_match`	检查整个字符串是否与正则表达式完全匹配。
`std::regex_search`	在字符串中搜索是否存在与正则表达式匹配的子串。
`std::regex_replace`	将匹配的子串替换为指定的格式字符串。

所有函数都可接受std::regex_constants::match_flag_type标志控制行为。

3.3 编译标志

构造std::regex时可指定语法选项和优化标志，常见如下：

std::regex pattern("...", std::regex_constants::ECMAScript | std::regex_constants::optimize);

ECMAScript：默认语法，类似 JavaScript。
grep、extended、awk、egrep：其他语法变体。
icase：忽略大小写。
optimize：提示正则引擎尽量优化，适合多次匹配场景。
multiline：使^和$匹配行的开头和结尾，而非整个字符串。

3.4 迭代器

std::regex_iterator：迭代字符串中所有匹配项。
std::regex_token_iterator：可迭代匹配项或特定捕获组，常用于字符串分割。

4. 安全优雅的 C++20 实践准则

4.1 用原始字符串字面量书写正则

C++ 正则中反斜杠非常多，传统写法要写"\\d{3}"，极易出错且难以维护。应始终使用R"()"：

auto phone_pattern = std::regex(R"(\d{3}-\d{4})"); // 清晰直观

4.2 避免重复编译正则对象

正则编译（构造std::regex）开销较大。最佳实践是将正则对象声明为static const，保证只编译一次。

static const std::regex email_regex(R"(^[\w.+-]+@[\w-]+\.[\w.-]+$)");

4.3 异常处理

正则语法错误、不支持的特性、以及内存分配等问题会抛出std::regex_error。健壮的代码应当捕获该异常：

try { static const std::regex re(R"(\d+)"); } catch (const std::regex_error& e) { std::cerr << "Regex error: " << e.what() << " (code: " << e.code() << ")\n"; // 进行合适的错误处理 }

4.4 善用`std::format`（C++20）输出结果

使用std::format可以让结果打印更为优雅，避免繁琐的流操作。

#include <format> #include <iostream> // ... std::cout << std::format("Match found at position {}: {}\n", match.position(), match.str());

4.5 将常用操作封装为可复用的函数

例如封装一个验证函数，返回bool；或封装一个提取函数，返回std::optional或std::vector。这既安全又优雅。

5. 完整代码示例

5.1 邮箱格式验证

#include <iostream> #include <regex> #include <string> #include <format> bool is_valid_email(std::string_view email) { // 通用的邮箱正则（简化版） static const std::regex pattern( R"(^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$)", std::regex_constants::ECMAScript | std::regex_constants::optimize ); try { return std::regex_match(email.begin(), email.end(), pattern); } catch (const std::regex_error&) { // 理论上静态正则不会在匹配时抛出异常，但保留安全性 return false; } } int main() { std::string test = "user@example.com"; std::cout << std::format("'{}' is valid: {}\n", test, is_valid_email(test)); test = "not-an-email"; std::cout << std::format("'{}' is valid: {}\n", test, is_valid_email(test)); }

5.2 提取日志中的日期

假设日志行格式为[2026-06-03 14:30:00] ERROR: message，我们要提取日期部分。

#include <iostream> #include <regex> #include <string> #include <optional> #include <format> std::optional<std::string> extract_date(const std::string& log_line) { // 捕获组：括号内为日期，格式 YYYY-MM-DD static const std::regex date_regex( R"(\[(\d{4}-\d{2}-\d{2})\s)", std::regex_constants::optimize ); std::smatch match; if (std::regex_search(log_line, match, date_regex) && match.size() > 1) { return match[1].str(); // 第一个捕获组 } return std::nullopt; } int main() { std::string log = "[2026-06-03 14:30:00] ERROR: Disk full"; if (auto date = extract_date(log)) { std::cout << std::format("Extracted date: {}\n", *date); } else { std::cout << "No date found.\n"; } }

5.3 敏感词替换

用*替换所有出现的敏感词，且忽略大小写。

#include <iostream> #include <regex> #include <string> #include <format> std::string censor_text(std::string text, const std::string& forbidden_word) { // 动态构造正则（此处演示，一般也尽量 static） try { std::regex word_regex(forbidden_word, std::regex_constants::ECMAScript | std::regex_constants::icase | std::regex_constants::optimize); return std::regex_replace(text, word_regex, "***"); } catch (const std::regex_error& e) { std::cerr << std::format("Regex error: {}\n", e.what()); return text; // 失败时返回原字符串 } } int main() { std::string message = "You are an idiot, IDIOT!"; std::string clean = censor_text(message, "idiot"); std::cout << std::format("Censored: {}\n", clean); }

5.4 遍历所有匹配（提取所有数字）

#include <iostream> #include <regex> #include <string> #include <vector> #include <format> std::vector<int> extract_all_numbers(const std::string& input) { static const std::regex number_regex(R"(\d+)", std::regex_constants::optimize); std::vector<int> numbers; // regex_iterator 遍历所有匹配 auto begin = std::sregex_iterator(input.begin(), input.end(), number_regex); auto end = std::sregex_iterator(); for (auto it = begin; it != end; ++it) { numbers.push_back(std::stoi(it->str())); } return numbers; } int main() { std::string data = "Price: 42, Discount: 15, Items: 3."; auto nums = extract_all_numbers(data); for (size_t i = 0; i < nums.size(); ++i) { std::cout << std::format("Number {}: {}\n", i + 1, nums[i]); } }