1 Introduction
- 定义 1.1 (Convex sets and convex functions)
- 我们关心的问题是定义在凸集上的凸函数的最小值问题
1.1 机器学习里的凸优化问题
$$\min_{x\in\mathbb{R}^n}\sum_{i=1}^mf_i(x) + \gamma\mathcal{R}(x)$$ 最小化表示代价+模型复杂性.
- 分类问题
- SVM: $f_i(x) = \max(0, 1-y_ix^Tw_i) \text{(the hinge loss)}, \mathcal{R}(x) = |x|_2^2$
- logistic regression: $f_i(x) = \log(1+\exp(-y_ix^Tw_i)), \mathcal{R}(x) = |x|_2^2$
- 回归问题: $f_i(x) = (x^Tw_i-y_i)^2, \mathcal{R}(x) = 0$
- $\mathcal{R}(x) = |x|_2^2$: the ridge regression problem
- $\mathcal{R}(x) = |x|_1$: the LASSO problem
- 稀疏逆协方差估计问题 (*) (书里提到的变形优化目标的文章叫 Exact matrix completion via convex optimization)
1.2 基本的凸性质
定理 1.1 (Separation Theorem) (*, 证明)
定理 1.2 (Supporting Hyperplane Theorem) (*, 证明)
定义 1.2 (Subgradients, 次梯度) $\mathcal{X}\subset \mathbb{R}, f: \mathcal{X}\rightarrow \mathbb{R}, g\in\mathcal{R}$是$f$在$x\in\mathcal{X}$处的次梯度, 如果对于任意$y\in\mathcal{X}$, 有 $$f(x)-f(y)\leq g^T(x-y)$$ $x$处的次梯度构成的集合被标记为$\partial f(x)$.
另一种说法是, 对于任意$x\in\mathcal{X}, g\in\partial f(x)$, $f$在线性函数$y\rightarrow f(x)+g^T(y-x)$
命题 1.1 (Existence of subgradients) 凸函数在凸集上的任意一点处均存在次梯度; 相反, 若一个函数在凸集上的任意一点处均存在次梯度, 则该函数为凸函数. 可微凸函数在某一点处的导数为该点处的次梯度.
命题 1.2 (Local minima are global minima) 凸函数的局部最优就是全局最优. 且只发生在 $0\in \partial f(x)$.
命题 1.3 (First order optimality condition) $f$为凸函数且在闭凸集$\mathcal{X}$上可微, 则$x^$是最小值当且仅当对任意$y\in\mathcal{X}$, 有$\nabla f(x^)^T(x^*-y)\leq 0$
1.4 Black-box model
The set of constraint $\mathcal{X}$ is known, and the objective function $f: \mathcal{X} \rightarrow \mathbf{R}$ is unknown but can be accessed through queries to oracles:
- A zeroth order oracle takes as input a point $x \in \mathcal{X}$ and outputs the value of $f$ at $x$.
- A first order oracle takes as input a point $x \in \mathcal{X}$ and outputs a subgradient of $f$ at $x$.
2 Convex optimization in finite dimension
本章讲解几个黑盒模型用于解决以下问题 $$\min. f(x) s.t. x\in\mathcal{X}$$ 其中$\mathcal{X}$是一个凸体, $f: \mathcal{X}\rightarrow[-B,B]$是连续凸函数. $r, R > 0, \mathcal{X}$包含于一个半径为$R$的欧式球.
2.1 The center of gravity method
定理 2.1 重心法满足 $$f(x_t)=\min_{x\in\mathcal{X}}f(x)\leq 2B\left(1 - \dfrac{1}{e}\right)^{t/n}$$
To attain an “$\epsilon$-optimal point the center of gravity method requires $O(n\log(2B/\epsilon))$ queries to both the first and zeroth order oracles.