当前位置：首页 > news >正文

Python正则表达式实战：高效提取信息的完整指南

news 2026/3/28 23:20:23

在数据处理和分析工作中，我们经常需要从非结构化文本中提取特定信息。正则表达式(Regular Expression)作为强大的文本处理工具，能够帮助我们快速定位和提取符合特定模式的内容。本文将通过实际案例，详细介绍如何在Python中使用正则表达式进行信息提取，帮助读者掌握这一实用技能。

正则表达式基础

什么是正则表达式

正则表达式是一种文本模式匹配工具，它使用特定的字符组合来描述搜索模式。Python通过re模块提供正则表达式支持。

基本语法速查

元字符	描述
`.`	匹配任意单个字符（除换行符）
`^`	匹配字符串开头
`$`	匹配字符串结尾
`*`	匹配前一个字符0次或多次
`+`	匹配前一个字符1次或多次
`?`	匹配前一个字符0次或1次
`\d`	匹配数字
`\w`	匹配字母数字下划线
`\s`	匹配空白字符
`[]`	匹配括号内任意字符
`	`
`()`	分组捕获

Python中使用正则表达式的步骤

1. 导入re模块

python

1import re 2

2. 编译正则表达式

python

1pattern = re.compile(r'\d+') # 匹配一个或多个数字 2

3. 使用匹配方法

match(): 从字符串开头匹配
search(): 扫描整个字符串匹配
findall(): 查找所有匹配项
finditer(): 返回迭代器对象
sub(): 替换匹配内容

实际应用案例

案例1：提取电子邮件地址

python

1import re 2 3text = """ 4请将反馈发送至 support@example.com 或 sales@company.org， 5紧急情况请联系 admin@test.net。 6""" 7 8# 方法1：使用findall 9emails = re.findall(r'[\w\.-]+@[\w\.-]+', text) 10print(emails) # 输出: ['support@example.com', 'sales@company.org', 'admin@test.net'] 11 12# 方法2：使用search和循环 13pattern = re.compile(r'([\w\.-]+)@([\w\.-]+)') 14match = pattern.search(text) 15if match: 16 print(f"完整邮箱: {match.group(0)}") 17 print(f"用户名: {match.group(1)}") 18 print(f"域名: {match.group(2)}") 19

案例2：提取日期信息

python

1import re 2 3log = """ 4系统启动时间: 2023-05-15 08:30:22 5最后更新: 2023/06/20 14:15:00 6错误发生于: 15.07.2023 09:45:10 7""" 8 9# 匹配三种日期格式 10date_patterns = [ 11 r'\d{4}-\d{2}-\d{2}', # YYYY-MM-DD 12 r'\d{4}/\d{2}/\d{2}', # YYYY/MM/DD 13 r'\d{2}\.\d{2}\.\d{4}' # DD.MM.YYYY 14] 15 16for pattern in date_patterns: 17 dates = re.findall(pattern, log) 18 print(f"找到{len(dates)}个{pattern}格式的日期:", dates) 19

案例3：提取HTML标签内容

python

1import re 2 3html = """ 4<div class="content"> 5 <h1>标题</h1> 6 <p>第一段内容</p> 7 <p>第二段内容</p> 8 <a href="https://example.com">链接</a> 9</div> 10""" 11 12# 提取所有<p>标签内容 13p_contents = re.findall(r'<p>(.*?)</p>', html, re.DOTALL) 14print("段落内容:", p_contents) 15 16# 提取所有标签及其内容 17tags = re.findall(r'<([a-z]+)[^>]*>(.*?)</\1>', html, re.DOTALL) 18print("\n所有标签及其内容:") 19for tag, content in tags: 20 print(f"<{tag}>: {content.strip()}") 21

案例4：复杂日志分析

python

1import re 2 3log_entry = """ 4[2023-07-20 14:30:45] ERROR: [User:12345] Database connection failed - Timeout expired 5[2023-07-20 14:31:10] WARNING: [User:67890] Low disk space (5% remaining) 6""" 7 8# 定义正则表达式模式 9pattern = re.compile( 10 r'\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] ' # 时间戳 11 r'(\w+): ' # 日志级别 12 r'\[User:(\d+)\] ' # 用户ID 13 r'(.*?)$' # 日志消息 14) 15 16# 提取所有日志条目 17for match in pattern.finditer(log_entry): 18 timestamp, level, user_id, message = match.groups() 19 print(f"时间: {timestamp}") 20 print(f"级别: {level}") 21 print(f"用户: {user_id}") 22 print(f"消息: {message}\n") 23

高级技巧

1. 使用命名分组

python

1import re 2 3text = "John Doe, 35 years old, lives in New York" 4pattern = re.compile( 5 r'(?P<name>[\w\s]+), ' 6 r'(?P<age>\d+) years old, ' 7 r'lives in (?P<city>[\w\s]+)' 8) 9 10match = pattern.search(text) 11if match: 12 print(f"姓名: {match.group('name')}") 13 print(f"年龄: {match.group('age')}") 14 print(f"城市: {match.group('city')}") 15

2. 非贪婪匹配

python

1import re 2 3text = "<div>第一段</div><div>第二段</div>" 4 5# 贪婪匹配（默认） 6greedy = re.findall(r'<div>(.*)</div>', text) 7print("贪婪匹配:", greedy) # 输出: ['第一段</div><div>第二段'] 8 9# 非贪婪匹配 10non_greedy = re.findall(r'<div>(.*?)</div>', text) 11print("非贪婪匹配:", non_greedy) # 输出: ['第一段', '第二段'] 12

3. 使用正则表达式预编译

对于需要多次使用的正则表达式，预编译可以提高性能：

python

1import re 2 3# 预编译正则表达式 4phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})') 5 6# 多次使用 7text1 = "我的电话是123-456-7890" 8text2 = "联系我们: 987-654-3210" 9 10for text in [text1, text2]: 11 match = phone_pattern.search(text) 12 if match: 13 print(f"完整号码: {match.group(0)}") 14 print(f"区号: {match.group(1)}") 15 print(f"前三位: {match.group(2)}") 16 print(f"后四位: {match.group(3)}\n") 17