当前位置：首页 > news >正文

别再只把CART当分类树了：手把手教你用Python实现回归树预测房价（附完整代码）

news 2026/7/23 17:41:34

别再只把CART当分类树了：手把手教你用Python实现回归树预测房价（附完整代码）

房价预测一直是数据分析领域的经典问题。传统的线性回归模型在面对非线性关系时往往力不从心，而决策树算法却能很好地捕捉特征间的复杂交互。本文将带你深入CART回归树的实战应用，从原理到代码实现，一步步构建房价预测模型。

1. 为什么选择CART回归树？

决策树算法中，CART（Classification and Regression Trees）是少数同时支持分类和回归任务的算法。与ID3、C4.5不同，CART采用二叉树结构，计算效率更高。在回归任务中，它通过递归二分数据空间，最终用叶节点的平均值作为预测输出。

回归树的三大优势：

自动处理非线性关系，无需人工特征工程
对异常值和缺失值不敏感
输出结果可解释性强，适合业务分析

from sklearn.tree import DecisionTreeRegressor # 基础模型构建只需两行代码 regressor = DecisionTreeRegressor(max_depth=3) regressor.fit(X_train, y_train)

2. 数据准备与特征工程

我们使用波士顿房价数据集演示，该数据集包含13个特征变量和房屋中位数价格标签。首先进行数据探索：

import pandas as pd from sklearn.datasets import load_boston boston = load_boston() df = pd.DataFrame(boston.data, columns=boston.feature_names) df['PRICE'] = boston.target

关键特征分析：

特征名	描述	与房价相关性
RM	平均房间数	0.7
LSTAT	低收入人群比例	-0.74
PTRATIO	师生比	-0.51

注意：决策树虽然不需要标准化处理，但高度相关的特征会影响特征重要性评估

3. 回归树建模全流程

3.1 基础模型构建

from sklearn.model_selection import train_test_split X = df.drop('PRICE', axis=1) y = df['PRICE'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) model = DecisionTreeRegressor( criterion='mse', # 使用均方误差作为分裂标准 max_depth=5, min_samples_split=10 ) model.fit(X_train, y_train)

3.2 模型评估指标

不同于分类任务，回归问题需采用不同的评估标准：

from sklearn.metrics import mean_squared_error, r2_score y_pred = model.predict(X_test) print(f'MSE: {mean_squared_error(y_test, y_pred):.2f}') print(f'R²: {r2_score(y_test, y_pred):.2f}')

评估结果对比：

模型	MSE	R²
线性回归	24.3	0.72
回归树(max_depth=3)	18.7	0.78
回归树(max_depth=5)	15.2	0.82

4. 关键参数调优实战

决策树容易过拟合，需要通过参数控制复杂度：

4.1 主要调参参数

params = { 'max_depth': [3, 5, 7, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] }

4.2 网格搜索实现

from sklearn.model_selection import GridSearchCV grid_search = GridSearchCV( estimator=DecisionTreeRegressor(), param_grid=params, cv=5, scoring='neg_mean_squared_error' ) grid_search.fit(X_train, y_train)

最优参数组合：

print(grid_search.best_params_) # 输出示例：{'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 10}

5. 模型解释与业务应用

5.1 特征重要性分析

import matplotlib.pyplot as plt features = X.columns importances = model.feature_importances_ plt.barh(features, importances) plt.xlabel('Feature Importance') plt.show()

5.2 决策路径解读

通过tree模块可以查看具体决策规则：

from sklearn.tree import export_text tree_rules = export_text(model, feature_names=list(X.columns)) print(tree_rules[:500]) # 打印前500个字符

典型决策路径示例：

如果LSTAT ≤ 14.8
且RM ≤ 7.04 → 预测价格=$23.5k
否则RM > 7.04 → 预测价格=$45.2k

6. 进阶技巧与注意事项

6.1 处理过拟合问题

使用ccp_alpha参数进行代价复杂度剪枝
设置max_leaf_nodes限制叶节点数量
通过早停策略防止过度生长

6.2 类别型特征处理

虽然波士顿房价数据集都是数值特征，但实际项目中常遇到类别变量：

# 使用OrdinalEncoder处理有序类别 from sklearn.preprocessing import OrdinalEncoder encoder = OrdinalEncoder(categories=[['低', '中', '高']]) X_train['装修等级'] = encoder.fit_transform(X_train[['装修等级']])

7. 完整项目代码示例

# 波士顿房价预测完整流程 import pandas as pd from sklearn.datasets import load_boston from sklearn.tree import DecisionTreeRegressor, export_text from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.metrics import mean_squared_error, r2_score import matplotlib.pyplot as plt # 数据加载 boston = load_boston() df = pd.DataFrame(boston.data, columns=boston.feature_names) df['PRICE'] = boston.target # 特征工程 X = df.drop('PRICE', axis=1) y = df['PRICE'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 模型训练 model = DecisionTreeRegressor( max_depth=5, min_samples_split=10, min_samples_leaf=2, random_state=42 ) model.fit(X_train, y_train) # 模型评估 y_pred = model.predict(X_test) print(f'MSE: {mean_squared_error(y_test, y_pred):.2f}') print(f'R²: {r2_score(y_test, y_pred):.2f}') # 特征重要性可视化 plt.figure(figsize=(10,6)) pd.Series(model.feature_importances_, index=X.columns).sort_values().plot.barh() plt.title('Feature Importance') plt.show()

在实际项目中，我发现将回归树的max_depth控制在5-7层之间，既能保持较好的预测性能，又不会让模型过于复杂。对于需要更高精度的场景，可以尝试集成学习方法如随机森林或梯度提升树，它们以回归树为基学习器，能显著提升预测效果。

查看全文

http://www.jsqmd.com/news/676503/