当前位置：首页 > news >正文

多项式回归实战指南：阶数选择、过拟合诊断与工业部署

news 2026/8/3 21:47:32

1. 这不是“高阶线性回归”的简单代称——它是一把双刃剑，用错就切到自己手指

你肯定见过这样的场景：用线性回归拟合房价数据，结果发现预测值总在真实值上下“摆头”；或者画出散点图，明显看出数据点连成一条弯曲的弧线，但硬套直线，R²再高也让人心里发虚。这时候，有人会说：“试试多项式回归吧！”——这句话本身没错，但背后藏着一个被严重低估的事实：多项式回归不是线性回归的“升级版”，而是对数据结构假设的一次根本性重写。它不改变模型求解的数学本质（依然是最小二乘法），却彻底改变了我们对变量关系的理解方式。我带过不少刚转行的数据分析新人，他们第一次用sklearn.preprocessing.PolynomialFeatures生成二次项，看到拟合曲线完美贴合训练点，兴奋地截图发群里，结果一跑测试集，MSE直接翻三倍。那一刻我才意识到，问题不在代码，而在认知——我们教了怎么“做”，却没讲清“为什么不能这么干”。这篇内容，就是为那些已经能写出LinearRegression().fit(X, y)、却在模型泛化时反复踩坑的人写的。它不讲抽象数学推导，只聚焦三个实操核心：如何判断你的问题真的需要多项式？如何用最少的阶数达成最稳的拟合？以及当模型开始“记答案”时，你手里的第一把手术刀是什么？无论你是用Python做业务分析的产品经理，还是正在啃机器学习课设的本科生，只要你的数据里有“弯曲”，这篇就是你该先读的说明书。

2. 核心设计逻辑：为什么非得是“多项式”？而不是样条、核方法或神经网络？

2.1 从物理世界出发：多项式是人类理解“弯曲”的第一直觉

想象一下修自行车链条。链条松了，你不会去查微分方程，而是凭经验拧紧两个螺丝——一个控制张力，一个控制偏移。这种“调两个旋钮就能让曲线弯起来”的直觉，就是多项式回归的底层逻辑。它的数学形式y = β₀ + β₁x + β₂x² + ... + βₙxⁿ，本质上是在用幂函数的线性组合去逼近任意光滑曲线。这和样条插值（把曲线切成几段分别拟合）或高斯过程（用概率分布描述函数空间）完全不同——它不追求“无限逼近”，而追求“用最简结构解释最大弯曲”。我做过一个对比实验：用同一组汽车加速时间 vs 速度数据，分别用3阶多项式、5节点三次样条、RBF核SVR拟合。结果很反直觉：多项式在测试集上的MAE比样条低12%，比SVR低28%。原因很简单：样条在节点处强制连续，反而引入了不必要的振荡；SVR的核函数把问题映射到高维空间，但原始数据维度只有1（速度），这种“升维”纯属过度设计。多项式回归的不可替代性，恰恰在于它的“笨拙”——它强迫你直面数据本身的弯曲程度，而不是用更复杂的工具掩盖建模失败。

2.2 阶数选择：不是越高越好，而是“刚好够用”的工程权衡

很多人以为“阶数=精度”，这是最大的误区。我曾接手一个电商销量预测项目，前任同事用了7阶多项式，训练R²高达0.992，但上线后首周预测误差超40%。拆开看系数：β₆ = -1.2e-8,β₇ = 3.7e-11，这些数字小到什么程度？相当于用纳米级游标卡尺去量一栋楼的高度——仪器精度远超需求，反而放大了测量噪声。真正的阶数选择，必须同时满足三个硬约束：

数据密度约束：n阶多项式有(n+1)个参数，至少需要(n+1)个独立样本才能唯一求解。但实际中，建议样本数 ≥ 5×(n+1)。比如你只有30个观测点，强行上5阶（6个参数），等效于用6个未知数解30个方程——看似冗余，实则每个参数都在“抢夺”解释权，导致系数剧烈震荡。
物理意义约束：阶数必须对应可解释的业务逻辑。比如预测电池衰减，1阶是线性损耗（不现实），2阶对应容量随循环次数平方衰减（符合电化学原理），3阶以上就很难给出物理解释了。我在给某新能源车企做BMS算法时，坚持用2阶而非4阶，就是因为工程师明确告诉我：“电解液分解速率与循环次数的平方成正比，更高阶项没有物理依据。”
计算稳定性约束：高阶幂运算会引发数值溢出。x=100时，x⁵=1e10，x¹⁰=1e20，浮点数精度直接崩塌。numpy.linalg.cond()计算的条件数会飙升到1e15以上（>1e12即视为病态）。这不是警告，是系统在告诉你：“这个方程组的解，可能比你的测量误差还大。”

提示：判断阶数是否合理的黄金法则是——画出各阶拟合曲线，观察最高阶项系数的绝对值是否大于其标准误的2倍。如果|βₙ| < 2×SE(βₙ)，说明该阶项对拟合无统计显著性，应降阶。

2.3 为什么不用其他非线性方法？——一张实操决策树

当你面对弯曲数据时，多项式回归只是工具箱里的一把螺丝刀。选它还是换别的，取决于你的“工作台”在哪。下表是我十年来在工业检测、金融风控、生物实验等12个领域踩坑后总结的决策指南：

场景特征	推荐方法	关键原因	我的实操备注
单变量、样本量<200、需快速部署	2-3阶多项式	计算快（O(n³)但n小）、可解释性强、易于嵌入C语言固件	某医疗设备公司用3阶多项式实现血糖仪校准，代码量仅87行
多变量、存在交互效应	多项式+特征工程	`PolynomialFeatures(degree=2, interaction_only=True)`可自动添加x₁x₂等交叉项，避免手动构造	切记关闭`include_bias=False`，否则会重复添加常数项
数据有尖锐拐点（如故障阈值）	分段线性回归	多项式在拐点处必然过冲，样条在节点处强制连续，分段线性用`scikit-learn`的`LinearRegression`配合`np.where()`更鲁棒	某风电场用此法识别叶片结冰临界风速，误报率降低63%
高维稀疏数据（如用户行为日志）	Lasso回归	多项式会爆炸式生成特征（d维数据2阶产生d(d+1)/2项），Lasso自动压缩不重要项系数至0	曾用Lasso筛选出关键3个交互项，特征数从210降至12
需要概率预测（如置信区间）	贝叶斯多项式回归	`pymc3`或`tensorflow-probability`可输出系数后验分布，直接得到预测不确定性	实测比Bootstrap快5倍，但需额外学习概率编程

这张表的核心思想是：没有“最好”的模型，只有“最适合当前约束”的模型。多项式回归的优势从来不是“万能”，而是“可控”——你能一眼看懂每个系数代表什么，能用手算验证结果，能在嵌入式设备上跑通。当你的KPI是“明天上午10点前给出可交付的预测脚本”，而不是“发表顶会论文”，它就是最值得信赖的伙伴。

3. 实操全流程：从数据诊断到过拟合手术刀，每一步都附真实代码与陷阱

3.1 第一步：用可视化诊断弯曲——别急着写代码，先让眼睛说话

所有失败的多项式回归，都始于错误的数据诊断。我见过太多人跳过这步，直接PolynomialFeatures(degree=3)，结果拟合出一条蛇形曲线。正确的起点，是用三张图建立数据直觉：

import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import r2_score, mean_squared_error # 模拟真实业务数据：某APP日活(DAU)与推广费用的关系 np.random.seed(42) X_raw = np.linspace(1, 100, 100).reshape(-1, 1) # 推广费用（万元） # 真实关系：初期增长快，后期饱和（S型），但加了噪声 y_true = 5000 + 200 * X_raw - 1.5 * X_raw**2 + 0.01 * X_raw**3 + np.random.normal(0, 300, (100, 1)) # 图1：原始散点图 + 线性拟合（红色虚线） plt.figure(figsize=(15, 4)) plt.subplot(1, 3, 1) plt.scatter(X_raw, y_true, alpha=0.6, s=20, label='Actual') lr = LinearRegression().fit(X_raw, y_true) y_lr = lr.predict(X_raw) plt.plot(X_raw, y_lr, 'r--', label=f'Linear (R²={r2_score(y_true, y_lr):.3f})') plt.xlabel('Promotion Cost (¥10k)') plt.ylabel('DAU') plt.title('Step 1: Is there curvature?') plt.legend() # 图2：残差图——弯曲的终极证据 plt.subplot(1, 3, 2) residuals = y_true - y_lr plt.scatter(y_lr, residuals, alpha=0.6, s=20) plt.axhline(y=0, color='r', linestyle='--') plt.xlabel('Fitted Values') plt.ylabel('Residuals') plt.title('Step 2: Residual pattern reveals curvature') # 图3：平滑曲线估计（LOESS）——告诉你要弯多少 from statsmodels.nonparametric.smoothers_lowess import lowess loess_fit = lowess(y_true.flatten(), X_raw.flatten(), frac=0.3) plt.subplot(1, 3, 3) plt.scatter(X_raw, y_true, alpha=0.6, s=20, label='Actual') plt.plot(loess_fit[:, 0], loess_fit[:, 1], 'g-', linewidth=2, label='LOESS smooth') plt.xlabel('Promotion Cost (¥10k)') plt.ylabel('DAU') plt.title('Step 3: LOESS suggests curvature degree') plt.legend() plt.tight_layout() plt.show()

这三张图要一起看：

左图告诉你线性模型是否足够：如果红虚线明显偏离点云中心，说明有弯曲；
中图是判决书：如果残差呈现“U型”或“倒U型”（而非随机散布），证明线性假设失效；
右图是导航仪：LOESS曲线的弯曲程度，直接暗示你需要几阶多项式。比如上图中LOESS呈温和上凸，2阶大概率够用；若出现S型拐弯，则需3阶。

注意：LOESS的frac参数很关键。frac=0.3表示每个点用30%的邻近点拟合，太小（如0.1）会过拟合噪声，太大（如0.7）会抹平真实弯曲。我的经验是：先设0.3，如果曲线太毛糙，逐步增大到0.5；如果太平滑，逐步减小到0.2。

3.2 第二步：构建多项式特征——`PolynomialFeatures`的隐藏开关

sklearn的PolynomialFeatures看似简单，但三个参数决定成败：

# 错误示范：默认参数埋雷 poly_default = PolynomialFeatures(degree=3) X_poly_bad = poly_default.fit_transform(X_raw) # 生成[1, x, x², x³] —— 但x范围是1~100！ # 正确做法：必须标准化+控制交互项 from sklearn.preprocessing import StandardScaler # 方案1：先标准化再多项式（推荐新手） scaler = StandardScaler() X_scaled = scaler.fit_transform(X_raw) # X_scaled范围≈[-1.5, 1.5] poly_std = PolynomialFeatures(degree=3, include_bias=True, interaction_only=False) X_poly_std = poly_std.fit_transform(X_scaled) # 方案2：用Pipeline一气呵成（生产环境首选） from sklearn.pipeline import Pipeline poly_pipeline = Pipeline([ ('scaler', StandardScaler()), ('poly', PolynomialFeatures(degree=3)), ('lr', LinearRegression()) ]) poly_pipeline.fit(X_raw, y_true)

为什么必须标准化？看数字：X_raw=100时，x³=1e6，而x=1时x³=1，特征尺度差异达10⁶倍。这会导致：

梯度下降收敛极慢（不同方向学习率需千差万别）；
正则化失效（L2惩罚对大系数几乎无影响）；
系数解释失真（β₃看起来很小，只是因为x³太大）。

interaction_only=True的适用场景：当你明确知道只有特定变量组合才有意义。比如预测房屋价格，area × rooms（每间房平均面积）有意义，但area²或rooms²无业务含义。此时开启它，可避免生成无意义的高阶单项。

实操心得：我处理过一个半导体良率预测项目，输入是12个工艺参数。用degree=2, interaction_only=True，特征数从91（全组合）降到78，但交叉验证R²反而提升0.02——因为剔除了temperature²这类物理上不可能的项。

3.3 第三步：过拟合诊断——比R²更关键的3个指标

训练R²接近1.0？恭喜，你很可能已过拟合。真正可靠的诊断，要看这三个指标：

# 拆分数据（注意：时间序列需用TimeSeriesSplit） from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X_raw, y_true, test_size=0.2, random_state=42 ) # 训练不同阶数模型 degrees = [1, 2, 3, 4, 5] train_scores, test_scores, train_mse, test_mse = [], [], [], [] for d in degrees: poly = PolynomialFeatures(degree=d) X_train_poly = poly.fit_transform(X_train) X_test_poly = poly.transform(X_test) lr = LinearRegression().fit(X_train_poly, y_train) train_scores.append(lr.score(X_train_poly, y_train)) test_scores.append(lr.score(X_test_poly, y_test)) train_mse.append(mean_squared_error(y_train, lr.predict(X_train_poly))) test_mse.append(mean_squared_error(y_test, lr.predict(X_test_poly))) # 绘制四象限诊断图 fig, axes = plt.subplots(2, 2, figsize=(12, 10)) axes[0, 0].plot(degrees, train_scores, 'bo-', label='Train R²') axes[0, 0].plot(degrees, test_scores, 'ro-', label='Test R²') axes[0, 0].set_xlabel('Degree'); axes[0, 0].set_ylabel('R²'); axes[0, 0].legend() axes[0, 0].set_title('R² vs Degree') axes[0, 1].plot(degrees, train_mse, 'bo-', label='Train MSE') axes[0, 1].plot(degrees, test_mse, 'ro-', label='Test MSE') axes[0, 1].set_xlabel('Degree'); axes[0, 1].set_ylabel('MSE'); axes[0, 1].legend() axes[0, 1].set_title('MSE vs Degree') # 新增：系数稳定性图（关键！） coeff_stability = [] for d in degrees: poly = PolynomialFeatures(degree=d) X_train_poly = poly.fit_transform(X_train) lr = LinearRegression().fit(X_train_poly, y_train) # 计算系数变异系数（标准差/均值） coeffs = np.abs(lr.coef_[0]) if len(coeffs) > 1: cv = np.std(coeffs) / np.mean(coeffs) if np.mean(coeffs) != 0 else 0 coeff_stability.append(cv) else: coeff_stability.append(0) axes[1, 0].plot(degrees, coeff_stability, 'go-') axes[1, 0].set_xlabel('Degree'); axes[1, 0].set_ylabel('Coeff CV'); axes[1, 0].set_title('Coefficient Stability') # 新增：条件数诊断（数值稳定性） cond_numbers = [] for d in degrees: poly = PolynomialFeatures(degree=d) X_train_poly = poly.fit_transform(X_train) # 计算设计矩阵X^T X的条件数 XtX = X_train_poly.T @ X_train_poly cond_num = np.linalg.cond(XtX) cond_numbers.append(cond_num) axes[1, 1].semilogy(degrees, cond_numbers, 'mo-') # 对数坐标 axes[1, 1].set_xlabel('Degree'); axes[1, 1].set_ylabel('Cond(X^T X)'); axes[1, 1].set_title('Numerical Stability') plt.tight_layout() plt.show()

这张图要这样读：

左上角（R²）：训练R²持续上升，测试R²在某点后下降——那个拐点就是最优阶数；
右上角（MSE）：测试MSE最低点即为最优，比R²更敏感（R²对小误差不敏感）；
左下角（系数变异系数）：CV > 0.5说明系数剧烈震荡，模型已不稳定；
右下角（条件数）：>1e6即为危险区，意味着微小数据扰动会导致系数巨变。

在我处理的37个工业项目中，82%的过拟合问题，首先暴露在系数稳定性图上——测试R²还没掉，但CV已突破0.8，这时立刻降阶，能避免后续所有麻烦。

3.4 第四步：过拟合手术刀——L2正则化的实操艺术

当诊断确认过拟合，Ridge回归是第一把手术刀。但它的alpha参数不是调参，而是工程权衡：

from sklearn.linear_model import Ridge from sklearn.model_selection import validation_curve # 用validation_curve自动找alpha alphas = np.logspace(-4, 2, 50) # 从0.0001到100 train_scores, valid_scores = validation_curve( Ridge(), X_train_poly, y_train, param_name='alpha', param_range=alphas, cv=5, scoring='r2' ) # 找到使验证R²最高的alpha best_alpha_idx = np.argmax(np.mean(valid_scores, axis=1)) best_alpha = alphas[best_alpha_idx] # 关键：比较正则化前后的系数变化 ridge = Ridge(alpha=best_alpha).fit(X_train_poly, y_train) print(f"Best alpha: {best_alpha:.4f}") print("Coefficients before/after regularization:") for i, (c_orig, c_ridge) in enumerate(zip(lr.coef_[0], ridge.coef_[0])): print(f" β{i}: {c_orig:.2e} → {c_ridge:.2e} ({abs(c_ridge/c_orig)*100:.0f}%)")

alpha的选择逻辑：

alpha太小（<0.01）：正则化无效，等同于普通线性回归；
alpha适中（0.1~10）：压制高阶项系数，保留低阶项主导，这是黄金区间；
alpha太大（>100）：所有系数趋近于0，模型退化为常数预测。

我的硬经验：先设alpha=1.0，如果高阶项系数压缩不足50%，再乘10；如果超过90%，除以10。不要盲目网格搜索——在工业现场，10次迭代内必须锁定alpha。

注意：Ridge会改变系数解释！正则化后β₂不再代表“x²对y的边际效应”，而是“在正则化约束下，x²的最优贡献”。所以业务汇报时，要说：“在控制过拟合的前提下，x²项的贡献强度为...”，而不是“x²每增加1单位，y增加β₂”。

4. 常见问题与排查技巧：那些文档里绝不会写的血泪教训

4.1 问题1：拟合曲线在端点疯狂震荡（Runge现象）

现象：用高阶多项式拟合均匀分布数据，在区间两端出现剧烈振荡，像波浪一样甩尾。

根因：这是多项式插值的数学固有缺陷，与数据无关。x∈[1,100]时，x⁵在x=100处的导数是x=1处的10¹⁰倍，微小误差被指数级放大。

实操解法：

方案A（推荐）：改用切比雪夫节点采样
不是改变模型，而是改变数据采集策略。切比雪夫节点在端点更密集，天然抑制振荡：

def chebyshev_nodes(n, a, b): """生成n个切比雪夫节点，区间[a,b]""" k = np.arange(1, n+1) nodes = 0.5*(a+b) + 0.5*(b-a)*np.cos((2*k-1)*np.pi/(2*n)) return np.sort(nodes) # 用切比雪夫节点重新采样（适用于实验设计阶段） X_cheby = chebyshev_nodes(50, 1, 100).reshape(-1, 1)

方案B：强制端点约束
在损失函数中加入端点平滑项：

from scipy.optimize import minimize def objective(coeffs, X, y, lambda_end=10): y_pred = np.polyval(coeffs[::-1], X.flatten()) # 注意polyval顺序 mse = np.mean((y_pred - y.flatten())**2) # 惩罚端点二阶导数过大 x_min, x_max = X.min(), X.max() ypp_min = np.polyval(np.polyder(coeffs[::-1], 2), x_min) ypp_max = np.polyval(np.polyder(coeffs[::-1], 2), x_max) penalty = lambda_end * (ypp_min**2 + ypp_max**2) return mse + penalty # 用scipy优化替代sklearn result = minimize(objective, x0=np.zeros(4), args=(X_train, y_train))

4.2 问题2：`PolynomialFeatures`生成特征爆炸，内存溢出

现象：degree=3, n_features=20时，PolynomialFeatures生成1771个特征，fit()直接OOM。

根因：特征数 = C(n+d, d)，d阶多项式在n维空间产生组合爆炸。

实操解法：

方案A：用interaction_only=True+include_bias=False
仅保留单项和交叉项，舍弃高次单项：

# 20维数据，2阶全组合：C(20+2,2)=231项 # 2阶仅交互：C(20,2)+20=190+20=210项（去掉x_i²） poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

方案B：增量式特征生成（适合大数据）
不一次性生成全部特征，而是按需计算：

class IncrementalPolynomial: def __init__(self, degree=2): self.degree = degree def transform(self, X, batch_size=1000): """分批生成，避免内存峰值""" n_samples = X.shape[0] results = [] for i in range(0, n_samples, batch_size): X_batch = X[i:i+batch_size] # 手动计算x, x², x_i*x_j X_1 = X_batch X_2 = X_batch**2 X_inter = np.array([np.outer(x, x)[np.triu_indices(len(x), 1)] for x in X_batch]) batch_feat = np.hstack([X_1, X_2, X_inter]) results.append(batch_feat) return np.vstack(results)

4.3 问题3：模型通过了所有检验，但业务方说“这曲线不符合常识”

现象：统计指标完美，但业务专家指着图说：“销量不可能随价格升高而无限增长”。

根因：多项式是全局模型，无法编码领域知识（如单调性、渐近线）。

实操解法：

方案A：添加物理约束的优化
用scipy.optimize.minimize加入不等式约束：

def constrained_poly_fit(X, y, degree=2, monotonic=True): def loss(coeffs): y_pred = np.polyval(coeffs[::-1], X.flatten()) return np.mean((y_pred - y.flatten())**2) # 约束：导数≥0（单调递增） if monotonic: def jac_constraint(coeffs): # 多项式导数：d/dx (c0 + c1*x + c2*x²) = c1 + 2*c2*x deriv_coeffs = np.polyder(coeffs[::-1]) # 在关键点检查导数≥0 x_check = np.linspace(X.min(), X.max(), 10) deriv_vals = np.polyval(deriv_coeffs, x_check) return deriv_vals # 返回向量，每个元素≥0 constraints = {'type': 'ineq', 'fun': jac_constraint} result = minimize(loss, x0=np.zeros(degree+1), constraints=constraints) else: result = minimize(loss, x0=np.zeros(degree+1)) return result.x # 使用 constrained_coeffs = constrained_poly_fit(X_train, y_train, degree=2, monotonic=True)

方案B：分段多项式 + 业务规则
先用业务逻辑分段，再在每段内拟合：

# 例：电商促销，分“日常”、“大促”、“清仓”三段 mask_normal = (X_train < 30) mask_promo = (X_train >= 30) & (X_train < 80) mask_clear = (X_train >= 80) models = {} for name, mask in zip(['normal', 'promo', 'clear'], [mask_normal, mask_promo, mask_clear]): if mask.sum() > 5: # 每段至少5个点 X_seg = X_train[mask] y_seg = y_train[mask] poly_seg = PolynomialFeatures(degree=2) X_seg_poly = poly_seg.fit_transform(X_seg) models[name] = LinearRegression().fit(X_seg_poly, y_seg)

4.4 问题4：部署时发现`PolynomialFeatures`无法序列化

现象：用joblib.dump()保存Pipeline，加载后predict()报错，提示PolynomialFeatures对象缺失。

根因：sklearn旧版本（<1.0）的PolynomialFeatures在__getstate__中未正确处理内部状态。

实操解法：

方案A：升级sklearn并用pickle
pip install --upgrade scikit-learn>=1.0后，以下代码稳定：

import pickle with open('poly_model.pkl', 'wb') as f: pickle.dump(poly_pipeline, f) with open('poly_model.pkl', 'rb') as f: loaded_model = pickle.load(f)

方案B：手动保存特征生成逻辑（兼容所有版本）
放弃保存PolynomialFeatures对象，只保存其转换逻辑：

class SimplePolyTransformer: def __init__(self, degree=2): self.degree = degree def fit_transform(self, X): self.X_mean = np.mean(X, axis=0) self.X_std = np.std(X, axis=0) + 1e-8 X_scaled = (X - self.X_mean) / self.X_std # 手动计算多项式特征 features = [np.ones(X.shape[0])] for d in range(1, self.degree+1): features.append(X_scaled**d) return np.column_stack(features) def transform(self, X): X_scaled = (X - self.X_mean) / self.X_std features = [np.ones(X.shape[0])] for d in range(1, self.degree+1): features.append(X_scaled**d) return np.column_stack(features) # 使用 transformer = SimplePolyTransformer(degree=2) X_train_poly = transformer.fit_transform(X_train) model = LinearRegression().fit(X_train_poly, y_train) # 保存transformer和model分开 joblib.dump(transformer, 'poly_transformer.joblib') joblib.dump(model, 'poly_model.joblib')

5. 终极实战：从零复现一个工业级多项式回归流水线

现在，让我们把所有碎片组装成一个可直接部署的完整流程。这个例子基于我为某智能灌溉系统做的土壤湿度预测模块，要求：运行在ARM Cortex-A7芯片上，内存<64MB，预测延迟<50ms。

import numpy as np import joblib from sklearn.linear_model import Ridge from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, r2_score class SoilMoisturePredictor: """ 工业级土壤湿度预测器 输入：温度(℃)、光照(lux)、时间戳(小时) —— 3维 输出：土壤湿度(%) 约束：模型体积<50KB，单次预测<10ms """ def __init__(self, degree=2, alpha=0.5): self.degree = degree self.alpha = alpha self.scaler = StandardScaler() self.model = Ridge(alpha=alpha, solver='cholesky') # cholesky最快 self.feature_names = None def _generate_features(self, X): """手动实现多项式特征，避免sklearn依赖""" n_samples, n_features_in = X.shape # 生成所有单项：x1, x2, x3 features = [X[:, i] for i in range(n_features_in)] # 生成所有2阶项：x1², x2², x3², x1x2, x1x3, x2x3 if self.degree >= 2: for i in range(n_features_in): features.append(X[:, i] ** 2) for i in range(n_features_in): for j in range(i+1, n_features_in): features.append(X[:, i] * X[:, j]) # 生成3阶项（仅当degree==3） if self.degree == 3: for i in range(n_features_in): features.append(X[:, i] ** 3) for i in range(n_features_in): for j in range(i+1, n_features_in): features.append(X[:, i] ** 2 * X[:, j]) features.append(X[:, i] * X[:, j] ** 2) return np.column_stack(features) def fit(self, X, y): # 1. 标准化 X_scaled = self.scaler.fit_transform(X) # 2. 生成特征 X_poly = self._generate_features(X_scaled) # 3. 训练Ridge self.model.fit(X_poly, y) # 4. 保存特征名（用于调试） self.feature_names = self._get_feature_names(X.shape[1]) return self def predict(self, X): X_scaled = self.scaler.transform(X) X_poly = self._generate_features(X_scaled) return self.model.predict(X_poly) def _get_feature_names(self, n_features_in): names = [] # 单项 for i in range(n_features_in): names.append(f'x{i}') # 2阶 if self.degree >= 2: for i in range(n_features_in): names.append(f'x{i}^2') for i in range(n_features_in): for j in range(i+1, n_features_in): names.append(f'x{i}*x{j}') return names # 模拟工业数据：温度、光照、时间（24小时制） np.random.seed(42) n_samples = 500 X_industrial = np.column_stack([ np.random.normal(25, 8, n_samples), # 温度 np.random.exponential(5000, n_samples), # 光照 np.random.uniform(0, 24, n_samples) # 时间 ]) # 真实关系：湿度 = 60 - 0.5*temp + 0.001*light - 0.1*time + 0.02*temp*time + noise y_industrial = ( 60 - 0.5 * X_industrial[:, 0] + 0.001 * X_industrial[:, 1] - 0.1 * X_industrial[:, 2] + 0.02 * X_industrial[:, 0] * X_industrial[:, 2] + np.random.normal(0, 2, n_samples) ) # 拆分数据 X_train, X_test, y_train, y_test = train_test_split( X_industrial, y_industrial, test_size=0.

查看全文

http://www.jsqmd.com/news/979785/