当前位置：首页 > news >正文

Kaggle心脏病预测实战：用Python从EDA到模型部署的完整流程（附代码避坑点）

news 2026/6/3 8:12:56

Kaggle心脏病预测实战：从数据探索到模型部署的完整Python指南

当你第一次接触Kaggle上的心脏病数据集时，可能会被各种医学术语和14个维度的特征搞得一头雾水。这个数据集包含了303个样本，每个样本记录了从年龄、性别到心电图测量等多方面的健康指标。作为数据科学爱好者，我们不仅要理解这些数据背后的医学意义，更重要的是掌握如何将这些原始数据转化为可行动的预测模型。本文将带你走完从数据清洗、探索性分析到模型训练和部署的全流程，特别关注那些容易被忽视但至关重要的实践细节。

1. 数据理解与预处理：构建高质量建模基础

1.1 数据加载与初步检查

开始任何数据分析项目的第一步都是彻底了解你的数据。使用pandas加载数据后，我们需要进行全面的"体检"：

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # 加载数据集 heart_df = pd.read_csv("heart.csv") # 基础检查清单 print(f"数据集形状: {heart_df.shape}") print("\n前5行数据预览:") print(heart_df.head()) print("\n数据类型与缺失值检查:") print(heart_df.info()) print("\n描述性统计:") print(heart_df.describe())

关键发现与处理建议：

检查到ca和thal列虽然显示为数值型，但实际上是分类变量
oldpeak(ST段压低值)存在负值，需要确认是否为数据录入错误
虽然isnull().sum()显示没有缺失值，但要注意某些列中的0值可能是缺失值的占位符

1.2 分类变量的特殊处理

医疗数据中的分类变量往往包含重要信息，但需要特殊编码：

# 分类变量映射字典 category_maps = { 'sex': {0: 'female', 1: 'male'}, 'cp': {0: 'asymptomatic', 1: 'typical', 2: 'atypical', 3: 'non-anginal'}, 'fbs': {0: '<=120', 1: '>120'}, 'exang': {0: 'no', 1: 'yes'}, 'thal': {0: 'missing', 1: 'normal', 2: 'fixed', 3: 'reversible'} } # 应用映射 for col, mapping in category_maps.items(): heart_df[col] = heart_df[col].map(mapping)

注意：在医疗数据中，保留原始分类标签的语义至关重要，这有助于后续结果解释和临床验证。

1.3 特征工程：从原始数据到更有意义的特征

基于医学知识，我们可以创建更有预测力的衍生特征：

# 创建年龄分段 heart_df['age_group'] = pd.cut(heart_df['age'], bins=[0, 40, 55, 65, 100], labels=['青年', '中年', '中老年', '老年']) # 血压分类(根据医学标准) heart_df['bp_category'] = pd.cut(heart_df['trestbps'], bins=[0, 90, 120, 140, 200], labels=['低血压', '正常', '高血压前期', '高血压']) # 胆固醇比值(更有医学意义) heart_df['chol_ratio'] = heart_df['chol'] / heart_df['age']

2. 探索性数据分析(EDA)：发现数据背后的故事

2.1 目标变量分布与类别平衡

心脏病预测是一个典型的二分类问题，首先检查类别分布：

plt.figure(figsize=(10,5)) ax = sns.countplot(x='target', data=heart_df) plt.title('心脏病诊断结果分布') for p in ax.patches: ax.annotate(f'{p.get_height()/len(heart_df):.1%}', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0,10), textcoords='offset points')

分析结果：

阳性样本(患病)：165例(54.5%)
阴性样本(未患病)：138例(45.5%)
数据基本平衡，无需进行过采样或欠采样

2.2 关键特征与目标变量的关系

使用可视化探索各特征与心脏病诊断的关系：

# 设置绘图风格 plt.style.use('seaborn') # 创建特征-目标关系图 fig, axes = plt.subplots(3, 3, figsize=(18, 15)) # 年龄与诊断 sns.boxplot(x='target', y='age', data=heart_df, ax=axes[0,0]) axes[0,0].set_title('年龄分布 vs 诊断结果') # 性别与诊断 sns.countplot(x='sex', hue='target', data=heart_df, ax=axes[0,1]) axes[0,1].set_title('性别分布 vs 诊断结果') # 胸痛类型与诊断 sns.countplot(x='cp', hue='target', data=heart_df, ax=axes[0,2]) axes[0,2].set_title('胸痛类型 vs 诊断结果') axes[0,2].tick_params(axis='x', rotation=15) # 最大心率与诊断 sns.violinplot(x='target', y='thalach', data=heart_df, ax=axes[1,0]) axes[1,0].set_title('最大心率分布 vs 诊断结果') # ST段压低与诊断 sns.kdeplot(data=heart_df, x='oldpeak', hue='target', ax=axes[1,1]) axes[1,1].set_title('ST段压低分布 vs 诊断结果') # 运动诱发心绞痛与诊断 sns.countplot(x='exang', hue='target', data=heart_df, ax=axes[1,2]) axes[1,2].set_title('运动诱发心绞痛 vs 诊断结果') # 斜率与诊断 sns.countplot(x='slope', hue='target', data=heart_df, ax=axes[2,0]) axes[2,0].set_title('ST段斜率 vs 诊断结果') # 主要血管数量与诊断 sns.countplot(x='ca', hue='target', data=heart_df, ax=axes[2,1]) axes[2,1].set_title('主要血管数量 vs 诊断结果') # 地中海贫血与诊断 sns.countplot(x='thal', hue='target', data=heart_df, ax=axes[2,2]) axes[2,2].set_title('地中海贫血 vs 诊断结果') axes[2,2].tick_params(axis='x', rotation=15) plt.tight_layout()

关键发现：

非典型胸痛(cp=atypical)的患者心脏病发病率显著更高
最大心率(thalach)较低的人群患病风险更高
运动诱发心绞痛(exang=yes)与心脏病高度相关
主要血管数量(ca)越多，患病风险越高

2.3 特征相关性分析

使用热图分析特征间的相关性：

# 计算相关系数 corr = heart_df.select_dtypes(include=['float64','int64']).corr() # 绘制热图 plt.figure(figsize=(12,10)) sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm', mask=np.triu(np.ones_like(corr, dtype=bool))) plt.title('特征相关性热图')

提示：对于高度相关的特征对(如age和chol_ratio)，考虑保留其中一个或创建组合特征，避免多重共线性问题。

3. 模型构建与评估：从基础到进阶

3.1 数据准备与特征工程

在建模前，我们需要完成最后的特征处理：

from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # 定义数值型和类别型特征 numeric_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'] categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'] # 创建预处理管道 preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features) ]) # 分割数据集 from sklearn.model_selection import train_test_split X = heart_df.drop('target', axis=1) y = heart_df['target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3.2 多种模型对比实验

我们将比较五种常见分类算法的表现：

from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.model_selection import cross_val_score # 定义模型 models = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'KNN': KNeighborsClassifier(), 'Decision Tree': DecisionTreeClassifier(max_depth=5), 'Random Forest': RandomForestClassifier(n_estimators=100), 'SVM': SVC(probability=True) } # 评估函数 def evaluate_models(models, X, y): results = {} for name, model in models.items(): pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)]) scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy') results[name] = scores.mean() return pd.DataFrame(results.items(), columns=['Model', 'Accuracy']) # 执行评估 model_results = evaluate_models(models, X_train, y_train) print(model_results.sort_values('Accuracy', ascending=False))

模型性能对比表：

模型	平均准确率	训练时间	可解释性
随机森林	0.85	中等	高
逻辑回归	0.83	快	高
SVM	0.82	慢	低
决策树	0.81	快	中
KNN	0.79	快	低

3.3 随机森林模型优化

基于初步结果，我们重点优化随机森林模型：

from sklearn.model_selection import GridSearchCV # 定义参数网格 param_grid = { 'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [None, 5, 10], 'classifier__min_samples_split': [2, 5, 10], 'classifier__min_samples_leaf': [1, 2, 4] } # 创建完整管道 pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier(random_state=42))]) # 网格搜索 grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1) grid_search.fit(X_train, y_train) # 最佳参数 print(f"最佳参数: {grid_search.best_params_}") print(f"最佳分数: {grid_search.best_score_:.4f}")

特征重要性分析：

# 获取最佳模型 best_model = grid_search.best_estimator_ # 提取特征重要性 importances = best_model.named_steps['classifier'].feature_importances_ # 获取特征名称 feature_names = (numeric_features + list(best_model.named_steps['preprocessor'] .named_transformers_['cat'] .get_feature_names_out(categorical_features))) # 创建重要性DataFrame importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances}) importance_df = importance_df.sort_values('Importance', ascending=False) # 可视化 plt.figure(figsize=(12,8)) sns.barplot(x='Importance', y='Feature', data=importance_df.head(15)) plt.title('Top 15 重要特征') plt.tight_layout()

4. 模型部署：从Jupyter Notebook到生产环境

4.1 模型保存与加载

训练好的模型需要持久化保存：

import joblib from datetime import datetime # 保存最佳模型 model_filename = f"heart_disease_model_{datetime.now().strftime('%Y%m%d')}.pkl" joblib.dump(best_model, model_filename) # 加载模型示例 loaded_model = joblib.load(model_filename)

4.2 创建预测API

使用Flask创建简单的预测接口：

from flask import Flask, request, jsonify import pandas as pd app = Flask(__name__) # 加载模型 model = joblib.load(model_filename) @app.route('/predict', methods=['POST']) def predict(): try: # 获取输入数据 input_data = request.json # 转换为DataFrame input_df = pd.DataFrame([input_data]) # 预测 prediction = model.predict(input_df) probability = model.predict_proba(input_df) # 返回结果 return jsonify({ 'prediction': int(prediction[0]), 'probability': float(probability[0][1]), 'status': 'success' }) except Exception as e: return jsonify({'status': 'error', 'message': str(e)}) if __name__ == '__main__': app.run(host='0.0.0.0', port=5000)

4.3 API测试与使用

使用curl测试API端点：

curl -X POST http://localhost:5000/predict \ -H "Content-Type: application/json" \ -d '{ "age": 58, "sex": "male", "cp": "atypical", "trestbps": 140, "chol": 289, "fbs": "<=120", "restecg": "normal", "thalach": 172, "exang": "no", "oldpeak": 0.0, "slope": "upsloping", "ca": 0, "thal": "normal" }'

预期响应：

{ "prediction": 1, "probability": 0.92, "status": "success" }

4.4 部署优化建议

对于生产环境，考虑以下优化措施：

输入验证：确保所有必填字段存在且值在合理范围内
性能监控：记录预测延迟和系统资源使用情况
模型版本控制：支持多模型版本并存和A/B测试
自动缩放：使用Kubernetes或类似技术根据负载自动调整资源
安全防护：实现身份验证和速率限制

5. 项目总结与进阶方向

在完成这个心脏病预测项目后，有几个关键经验值得分享：

数据质量决定上限：医疗数据中的异常值和编码问题会显著影响模型性能，花在数据清洗上的时间通常比建模更多
特征工程的艺术：基于医学知识创建的特征(如胆固醇年龄比)往往比原始特征更有预测力
模型可解释性的重要性：在医疗领域，能够解释为什么做出特定预测通常比绝对准确率更重要

进阶探索方向：

尝试深度学习模型(如神经网络)并比较性能
集成多个模型创建更强大的预测系统
开发前端界面使临床医生能更方便地使用预测工具
添加模型解释功能(如SHAP值)帮助理解预测依据

查看全文

http://www.jsqmd.com/news/660957/

从DSSM到美团双塔：聊聊推荐系统召回阶段那些‘负样本’的坑与实战经验

口碑好的专升本机构探讨，飞扬专升本学员评价分享与实力评估 - mypinpai

手把手教你用Python脚本批量下载与转换香港CORS的RINEX数据（附Matlab工具链接）

Anthropic说Opus 4.7工具错误降了2/3，我拿30个MCP工具实测了一下

避坑指南：处理Tusimple数据集时，为什么你的generate_tusimple_dataset.py脚本‘卡住’了？

开箱即用！音频像素工坊快速部署教程，打造你的专属音频处理工具箱

STM32 CANopenNode实战指南：如何在5步内构建工业级CANopen从站

性价比高的木质防火门厂家怎么选择，深度剖析优质源头厂家 - 工业品网

在Ubuntu 22.04上，用Picovoice离线语音助手控制智能家居（从唤醒词到执行命令全流程）

Rust Trait 对象的内存布局

MATLAB/Simulink 2024A实战：手把手教你搭建PMSM无磁链环DTC仿真模型（附源码）

Beaver Notes终极指南：打造本地优先的高效隐私笔记系统

从SRCNN到ESPCN：亚像素卷积如何重塑实时超分效率

别再只跑个模型了！用R语言因子分析挖掘省份消费数据里的隐藏故事

2026年好用的酒店厨房装修公司推荐，实力强售后有保障 - 工业设备

终极解决方案：3分钟破解城通网盘限速，免费获取满速下载！

Winhance中文版：3大核心功能彻底解决Windows系统优化难题

华硕笔记本性能优化终极指南：G-Helper的7个高效使用技巧

告别纯CNN时代？从YOLOv12的‘区域注意力’看目标检测架构的融合趋势

跨平台文本编辑新选择：Notepad-- 如何成为开发者工具箱中的瑞士军刀？

FSearch极速文件搜索工具：如何在Linux系统中实现秒级文件检索的终极指南

2026年全网必备降AI率工具实测合集：论文AI率降至8%（持续更新附传送门） - 降AI实验室

Applite：3步告别命令行，实现Mac软件管理的图形化高效革命

别再硬算偏微分方程了！用Python和PyTorch搭建你的第一个PINN模型（附完整代码）

gmx_MMPBSA深度解析：GROMACS结合自由能计算的终极指南

YOLO CPU 前处理优化：5 种 HWC→NCHW 转换方法全网最详对比（速度测试+工程级代码）

惠州冲压模胚（模架）定制加工厂家——昌晖金属制品有限公司 - 昌晖模胚

如何用gym-pybullet-drones快速搭建无人机强化学习仿真环境：完整指南

如何构建企业级ComfyUI工作流：深度解析Crystools插件的高级调试与性能优化

小白本地部署 OpenClaw 自动发布小红书