Learning from Multiple Teacher Networks http://library.usc.edu.ph/ACM/KKD%202017/pdfs/p1285.pdf

  • loss:
    • teachers的softmax输出取平均和student的交叉熵
    • 中间层表示的相对相异度(仅适用于MTKD), 三元组$(q_i,q_i^+,q_i^-)$, 其中$q$是中间层的输出, 偏序关系$q_i^+ > q_i^-$由两者和$q_i$的距离$d$决定, 参数$w_s$决定选取哪些层. 在不同teacher中, 中间层输出的偏序关系可能不同, 因此用投票法决定中间层输出应当的偏序关系, 设计和student对应层输出的loss, 以此鼓励student拥有和teacher中间层类似的相对相似(相异)关系.
    • student和groudtruth的交叉熵
  • 实验设置: 基于CIFAR-10, CIFAR-100, MNIST, SVHN的实验
    • CIFAR-10, 比较student不同层数和参数量(11/250K, 11/862K, 13/1.6M, 19/2.5M)时的表现(compression rate, acceleration rate and classification accuracy)(和Fitnets比较)
    • CIFAR-10, student均为11层, 比较当student的参数为250K和862K时, teacher数量为1, 3, 5时, Teacher, RDL, FitNets, KD和他们的准确率
    • CIFAR-10, CIFAR-100, 比较不同方法(Teacher(5层), FitNets, KD, Maxout Networks, Network in Network, Deeply-Supervised Networks和此方法(19层))在两个数据集上的准确率
    • MNIST, 比较不同方法(Teacher(4层), FitNets, KD, Maxout Networks, Network in Network, Deeply-Supervised Networks和此方法(7层))的准确率
    • SVHN, 比较不同方法(Teacher(5层), FitNets, KD, Maxout Networks, Network in Network, Deeply-Supervised Networks和此方法(19层))的准确率
  • @inproceedings{you2017learning, author={You, Shan and Xu, Chang and Xu, Chao and Tao, Dacheng}, booktitle={Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining}, title={Learning from multiple teacher networks}, pages={1285–1294}, year={2017} }

Semi-Supervised Knowledge Transfer for Deep Learning from Private Training Data. ICLR 2017 https://openreview.net/pdf?id=HkwoSDPgg

  • 学习算法应当保护用户私人数据, 但模型会记住数据, 也会受到攻击(black/white box attack).
  • 将样本分成n份, 分别训练teacher model, 将这些teachers聚合:
    • 如果大多数teacher有相同的输出, 则输出不依赖于分别训练teacher的不相交集
    • 如果有某两类票数相近, 则分歧可能会泄露私人信息(我不理解)
    • 在投票中引入随机噪声
  • student是半监督, 一部分是利用private data通过teacher得到的label, 之后使用无标记的public data. 利用GAN训练student, discriminator增加1类(m+由生成器生成), 训练后只使用discriminator.
  • 实验设置:
    Dataset Teacher Student Student Public Data testing Data
    MNIST 2 conv + 1 relu GANS(6 fc layers) test[:1000] test[1000:]
    SVHN 2 conv + 2 relu GANS(7 conv + 2 NIN) test[:1000] test[1000:]
    UCI Adult RF(100 trees) RF(100 trees) test[:500] test[500:]
    UCI Diabetes RF(100 trees) RF(100 trees) test[:500] test[500:]
  • @article{Papernot2017SemisupervisedKT, title={Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data}, author={Nicolas Papernot and Mart{'i}n Abadi and {'U}lfar Erlingsson and Ian J. Goodfellow and Kunal Talwar}, journal={arXiv preprint}, pages = {}, year={2016} }

Knowledge Adaptation: Teaching to Adapt. ICLR, 2017 https://openreview.net/pdf?id=rJRhzzKxl

  • teachers的domain和student $\mathcal{D}{T_i}, \mathcal{D}S$不完全一致, 因此student对teacher的信任度取决于两者表示空间的相似度 $\mathcal{L} = \mathcal{H}(\sum{i=1}sim(\mathcal{D}{T_i}, \mathcal{D}S)\cdot D{T_i}, D_S)$
  • 定义MCD, MCD越大表示越远离分类边界, 输入teacher得到的结果置信度越高. 因此选取MCD最大(即置信度最高)的n个样本, 在teacher中得到的结果作为伪标记以训练student可以使无监督学习的性能得到提升.
  • 实验设置: 基于Amazon product reviews sentiment analysis dataset. 包含Book, DVD, Electronics, Kitchen四类.
    • 比较teacher, 由以相同类别样本训练的teacher训练出来的student, 由以所有样本训练的teacher训练出来的student, 结合以上两种teacher训练出来的student, 以及许多其他模型(SCL, SFA, SCL-com, SFA-com, SST, IDDIWP, DWHC, DAM, CP-MDA, SDAMS-SVM, SDAMS-Log)在四种类别上的性能.
    • 比较分别以其中三类为domain的teacher训练以第四类为domain的student(B$\rightarrow$D,E$\rightarrow$D,K$\rightarrow$D依次轮换)在不同方法下的性能
  • @article{ruder2017knowledge, title={Knowledge adaptation: Teaching to adapt}, author={Ruder, Sebastian and Ghaffari, Parsa and Breslin, John G}, journal={arXiv preprint arXiv:1702.02052}, pages = {}, year={2017} }

Deep Model Compression: Distilling Knowledge from Noisy Teachers. Sau, Bharat Bhusan et al. arXiv:1610.09650 https://arxiv.org/pdf/1610.09650.pdf

  • 在teacher的输出(student的目标)上加扰动, 等价于基于噪声的正则化. 可用于模型压缩.
  • 实验设置: 基于MNIST, SVHN, CIFAR-10.
    • MNIST: teacher – a modified network of LeNet([C5(S1P0)@20-MP2(S2)]- [C5(S1P0)@50-MP2(S2)]- FC500- FC10); student – FC800-FC800-FC10
    • SVHN: Network-in-Network([C5(S1P2)@192]- [C1(S1P0)@160]- [C1(S1P0)@96-MP3(S2)]- D0.5- [C5(S1P2)@192]- [C1(S1P0)@192]- [C1(S1P0)@192- AP3(S2)]- D0.5- [C3(S1P1)@192]- [C1(S1P0)@192]- [C1(S1P0)@10]- AP8(S1)); student: LeNet([C5(S1P2)@32-MP3(S2)]- [C5(S1P2)@64-MP3(S2)]- FC1024-FC10)
    • CIFAR-10: teacher: same as SVHN; student: a modified version of the LeNet([C5(S1P2)@64-MP2(S2)]- [C5(S1P2)@128- MP2(S2)]-FC1024-FC10).
  • @article{sau2016deep, title={Deep model compression: Distilling knowledge from noisy teachers}, author={Sau, Bharat Bhusan and Balasubramanian, Vineeth N}, journal={CoRR}, pages = {}, year={2016} }

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Tarvainen, Antti and Valpola, Harri. NeurIPS 2017 https://papers.nips.cc/paper/2017/file/68053af2923e00204c3ca7c6a3150cf7-Paper.pdf

  • 算法:
    • 构建一个普通的监督模型;
    • copy一份监督学习模型, 原模型叫做student, 新的叫teacher;
    • 每步使用同样的minibatch输入到student与teacher模型中, 但在输入数据前分别加入随机增强或者噪音;
    • 加入student与teacher输出的一致性损失函数;
    • 优化器只更新student的权重;
    • 每步之后, 采用student权重的EMA更新teacher权重;
  • 核心思想是: 模型既充当学生, 又充当老师. 作为老师, 用来产生学生学习时的目标; 作为学生, 则利用教师模型产生的目标来进行学习. 而教师模型的参数是由历史上(前几个step)几个学生模型的参数经过加权平均得到.
  • 可以看成是П-model中的两次计算中模型换成了两个不同的模型, 一个叫teacher, 一个叫student; 另外, 也可以看成作Temporal ensembling的改进版, 在Temporal ensembling中, 采用的是每epoch的指数移动平均值来聚合历史数内容, 而Mean teacher则是在每训练步进行对Student的权重进指数移动平均.
  • 实验设置: 基于SVHN和CIFAR-10
    • All the methods in the comparison use a similar 13-layer ConvNet architecture.
  • @inproceedings{10.5555/3294771.3294885, title = {Mean Teachers Are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-Supervised Deep Learning Results}, author = {Tarvainen, Antti and Valpola, Harri}, booktitle = {Proceedings of the 31st International Conference on Neural Information Processing Systems}, pages = {1195–1204}, year = {2017} }

Born-Again Neural Networks. Furlanello, Tommaso et al. ICML 2018 https://proceedings.mlr.press/v80/furlanello18a/furlanello18a.pdf

  • 基于新模型的输入和原模型的输入间的交叉熵, 使用KD项修改替代和正则化原来的loss
  • Selves Born-Again Networks集成的学习顺序: $\mathcal{L}(f(x, \arg\min_{\theta_{k-1}}\mathcal{L}(f(x, \theta_{k-1}))),f(x,\theta_k))$, 将上一个student学到的知识作为监督信息, 教导下一个学生.
  • 实验设置:
    • CIFAR-10: Wide-ResNet with different depth and width (28-1, 28-2, 28-5, 28-10) and DenseNet of different depth and growth factor (112-33, 90-60, 80-80, 80-120)
    • CIFAR-100: 与上同.
  • @inproceedings{Furlanello2018BornAN, title={Born Again Neural Networks}, author={Tommaso Furlanello and ZaKnowledge Adaptation: Teaching to Adaptchary Chase Lipton and Michael Tschannen and Laurent Itti and Anima Anandkumar}, booktitle={ICML}, year={2018} }

Deep Mutual Learning. Zhang, Ying et al. CVPR 2018 https://openaccess.thecvf.com/content_cvpr_2018/papers/Zhang_Deep_Mutual_Learning_CVPR_2018_paper.pdf

  • 两个(多个?)students相互学习, 对于每个student, 损失为和groundtruth的交叉熵以及相对于另一个student的softmax输出的KL散度: $\mathcal{L}{\theta_1} = \mathcal{L}{C_1} + D_{KL}(p_2|p_1), \theta_1\leftarrow\theta_1+\gamma_t\dfrac{\partial\mathcal{\theta_1}}{\partial\theta_1}$; $\mathcal{L}{\theta_2} = \mathcal{L}{C_2} + D_{KL}(p_1|p_2), \theta_2\leftarrow\theta_2+\gamma_t\dfrac{\partial\mathcal{\theta_2}}{\partial\theta_2}$
  • 优点:
    • 随着学生网络的增加其效率也得到提高
    • 它可以应用在各种各样的网络中, 包括大小不同的网络
    • 即使是非常大的网络采用相互学习策略, 其性能也能够得到提升
  • 实验设置:
    • 数据集: ImageNet, CIFAR-10, CIFAR-100, Market-1501
    • Networks:
      ResNet-32 MobileNet InceptionV1 WRN-28-10
      0.5M 3.3M 7.8M 36.5M
  • @inproceedings{8578552, title = {Deep Mutual Learning}, author = {Y. Zhang and T. Xiang and T. M. Hospedales and H. Lu}, booktitle = {2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {4320-4328}, year = {2018} }

Data Distillation: Towards Omni-Supervised Learning. Radosavovic, Ilija et al. CVPR 2018 https://openaccess.thecvf.com/content_cvpr_2018/papers/Radosavovic_Data_Distillation_Towards_CVPR_2018_paper.pdf

  • Model Distillation vs. Data Distillation: 前者ensemble同一样本在不同模型的输出, 后者ensemble同一样本经不同转换后在同一模型的输出.
  • 方法:
    • 用手动标注的数据训练模型A
    • 用模型A去训练数据增广 (本文中为 scaling and horizontal flipping) 的未标注数据
    • 将未标注数据的预测结果通过 ensembling 多个预测结果, 转化为 labels
    • 在手动标注和自动标注的数据集重新训练模型
  • 实验: 在COCO Keypoint Detection, Object Detection 验证. teacher和student都是Mask R-CNN keypoint detection variant
  • @inproceedings{inproceedings, title = {Data Distillation: Towards Omni-Supervised Learning}, author = {Radosavovic, Ilija and Dollar, Piotr and Girshick, Ross and Gkioxari, Georgia and He, Kaiming}, year = {2018}, doi = {10.1109/CVPR.2018.00433} pages = {4119-4128} }

Multilingual Neural Machine Translation with Knowledge Distillation. ICLR 2019 https://openreview.net/pdf?id=S1gUsoR9YX

  • 方法很简单, 就是先针对每对语言训练单独的翻译模型作为teacher, 再用multi-teacher KD训练student, loss就是student和label的交叉熵以及和teacher的softmax输出的交叉熵.
  • 实验设置: 数据集: IWSLT, WMT, Ted Talk; student和teacher均使用Transformer
    Task model hidden size $d_{model}$ feed-forward hidden size $d_{ff}$ number of layer
    IWSLT and Ted talk tasks 256 1024 2
    WMT task 512 2048 6
  • @article{Tan2019MultilingualNM, title={Multilingual Neural Machine Translation with Knowledge Distillation}, author={Xu Tan and Yi Ren and Di He and Tao Qin and Zhou Zhao and Tie-Yan Liu}, journal={ICLR}, year={2019}, volume={abs/1902.10461} }

Unifying Heterogeneous Classifiers with Distillation. Vongkulbhisal et al. CVPR 2019 https://openaccess.thecvf.com/content_CVPR_2019/papers/Vongkulbhisal_Unifying_Heterogeneous_Classifiers_With_Distillation_CVPR_2019_paper.pdf

  • N个不同的模型$\mathcal{C} = {C_i}_{i=1}^N$具有不同的结构和目标类别, $C_i$被训练以分别预测$p_i(Y=l_j)$, 并整合出样本在所有类中的概率$q(Y=i_j)$. 最后利用$q$训练student.
  • 作者提出了基于交叉熵最小化和矩阵分解的方法,从未标记的样本中估计所有类别的soft-labels.
  • 实验设置:
    • 数据集: ImageNet, LSUN, Places365
    • $C_i$从AlexNet, VGG16, ResNet18, ResNet34中随机选择
  • @article{Vongkulbhisal2019UnifyingHC, title={Unifying Heterogeneous Classifiers With Distillation}, author={Jayakorn Vongkulbhisal and Phongtharin Vinayavekhin and Marco Visentini Scarzanella}, journal={2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2019}, pages={3170-3179} }

Distilled Person Re-Identification: Towards a More Scalable System. Wu, Ancong et al. CVPR 2019 https://openaccess.thecvf.com/content_CVPR_2019/papers/Wu_Distilled_Person_Re-Identification_Towards_a_More_Scalable_System_CVPR_2019_paper.pdf

  • 解决三个问题: 降低标签成本(减少标签的需求量); 降低跨数据库成本(利用一些先验知识); 降低测试成本(使用轻量级网络)
  • 假设taregt domain包含10个类的图片, 先用多个个source domain分别训练多个teacher model, source domain之后并不会被用到(利用一些先验知识–降低跨数据库成本); target domain可以只包含10个labelled sample(10类均有), 其余均为unlabeled sample, 对于N个unlabelled input, 定义相似度矩阵$A$, 其中第i行第j列表示第i个图像和第j个图像在同一个模型下输出的相似度. 为了将知识从teacher迁移到student, 需要最小化teacher的相似度矩阵$A_T$和student的相似度矩阵$A_S$的距离(这句话是学习single teacher).
  • 分别利用teacher计算target domain中每一个x的特征向量, 并分别计算相似度矩阵$A$, 使用$L_{ver}$更新每一个老师模型的权重$a$(可以理解为,权重越大,该老师模型对应的source和target越相似)
  • 计算出每一个老师模型和学生模型得到的相似矩阵的差异,并使用上述的权重加权,从而得到$L_{ta}$. 使用$L_{ta}$对学生模型进行更新, 循环训练.
  • 实验设置: 数据集 – Market-1501, DukeMTMC. 分别使用MSMT17, CUHK03, ViPER, DukeMTMC, Market-1501训练5个teacher model$T_1, T_2, T_3, T_4, T_5$; teacher – an advanced Re-ID model PCB, student – a lightweight mod- el MobileNetV2.
  • @InProceedings{Wu_2019_CVPR, author = {Wu, Ancong and Zheng, Wei-Shi and Guo, Xiaowei and Lai, Jian-Huang}, title = {Distilled Person Re-Identification: Towards a More Scalable System}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2019} }

Diversity with Cooperation: Ensemble Methods for Few-Shot Classification. Dvornik, Nikita et al. ICCV 2019 https://openaccess.thecvf.com/content_ICCV_2019/papers/Dvornik_Diversity_With_Cooperation_Ensemble_Methods_for_Few-Shot_Classification_ICCV_2019_paper.pdf

  • meta learning – learning to learn
  • 模型间的关系有三种 - 合作(预测的结果无论是属于正确结果的概率还是属于错误结果的概率都是比较一致的), 独立(两个模型预测的结果之间不存在明显的关系), 多样性(除了正确结果,预测为其他结果的概率差异明显).
  • 文章通过出了交叉熵损失外设计不同的损失函数$\psi(y_i,f_{\theta_j}(x_i),f_{\theta_l}(x_i))$诱导模型的关系向不同方向发展, 例如基于cos或KL散度.
  • 实验设置:
    • 数据集: mini-ImageNet, tiered-ImageNet, Caltech-UCSD Birds (CUB) 2002011.
    • ensemble of ResNet18 and WideResNet28
  • @INPROCEEDINGS{9010380, title={Diversity With Cooperation: Ensemble Methods for Few-Shot Classification},author={Dvornik, Nikita and Mairal, Julien and Schmid, Cordelia}, booktitle={2019 IEEE/CVF International Conference on Computer Vision (ICCV)}, pages={3722-3730},
    year={2019} }

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System. Yang, Ze et al. WSDM 2020 https://arxiv.org/pdf/1910.08381.pdf

  • 一种用于网络问答系统的两阶段多教师知识蒸馏(简称 TMKD)方法, 首先用一个通用的问答提炼任务对student进行预训练(貌似也是使用Multi-teacher), 并在下游任务(如 Web Q&A 任务, MNLI, SNLI, 来自 GLUE 的 RTE 任务)上使用Multi-Teacher KD进一步微调这个预训练的student.
  • “early calibration” effect缓解了单个teacher造成的过拟合偏差.
  • 实验设置:
    • 数据集 - DeepQA, CommQA-Unlabeled, CommQA-Labeled, MNLI, SNLI, QNLI, RTE.
    • Baseline: teacher - BERT-3, BERT_{large}, BERT_{large}Ensemble; student(Traditional Distillation Model) - Bi-LSTM(1-o-1, 1_{avg}-o-1, m-o-m), BERT3(1-o-1, 1_{avg}-o-1, m-o-m), student(TMKD) - Bi-LSTM(TMKD), TMKD_{base}, TMKD_{large}(后两者都是BERT-3 models).
  • @inproceedings{inproceedings, author = {Ze, Yang and Shou, Linjun and Gong, Ming and Lin, Wutao and Jiang, Daxin}, title = {Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System}, publisher = {Association for Computing Machinery}, doi = {10.1145/3336191.3371792}, pages = {690-698}, year = {2020} }

FEED: Feature-level Ensemble for Knowledge Distillation. Park, SeongUk and Kwak, Nojun. AAAI 2020 https://openreview.net/pdf?id=BJxYEsAqY7

  • 实验方法就是让student直接学teacher的feature.
  • 实验设置: 数据集: CIFAR-100; 选取模型: student – ResNet-56, ResNet-110, WRN28-10, ResNext29-16x64d; 没说teacher是谁.
  • @article{Park2019FEEDFE, title={FEED: Feature-level Ensemble for Knowledge Distillation}, author={Seonguk Park and Nojun Kwak}, journal={ECAI}, year={2019}, volume={abs/1909.10754} }

Stochasticity and Skip Connection Improve Knowledge Transfer. Lee, Kwangjin et al. ICLR 2020 https://openreview.net/pdf?id=HklA93NYwS

  • 利用单个教师网络生成多个教师网络(加入stochastic blocks和skip connections)并训练学生网络, 分块并含有skip connections的网络可以看成树状网络, 从input到output有多条路径.
  • 实验设置: 数据集 – CIFAR-100 和 tiny imagenet, 并将这种方法应用到KD, AT(attention tranfer), ML. 实验中涉及到的teacher有ResNet 32, ResNet 110, WRN 28-10, MobileNet, WRN 40-4; 涉及到的student有VGG 13, ResNet 20, ResNet 32, WRN 40-4.
  • @INPROCEEDINGS{9287227, author={Nguyen, Luong Trung and Lee, Kwangjin and Shim, Byonghyo}, title={Stochasticity and Skip Connection Improve Knowledge Transfer}, booktitle={2020 28th European Signal Processing Conference (EUSIPCO)}, pages={1537-1541}, year={2021} }

Hydra: Preserving Ensemble Diversity for Model Distillation. Tran, Linh et al. arXiv:2001.04694 http://www.gatsby.ucl.ac.uk/~balaji/udl2020/accepted-papers/UDL2020-paper-026.pdf

  • 普通multi-teacher KD对teacher的预测值取平均, 这样会丧失多teacher结果包含的不确定性信息(?), 本文将student拆分成body和多个head, 每个head对应一个teacher, 以保留多teacher输出的多样性
  • 假设有M个teacher, 首先训练一个head直至其收敛至teacher的平均, 再添加其他M-1个head, M个head一起训练, 实验证明如果没有第一个head会很难收敛. 作者定义了一个模型不确定性, 由数据不确定性和总不确定性组成(我不理解为什么是这个顺序).
  • 实验设置:
    • 数据集: a spiral toy dataset(用于可视化并解释模型不确定性), MNIST(测试时用了它的测试集和Fashion-MNIST), CIFAR-10(测试时用了它的测试集, cyclic translated test set, 80 different corrupted test sets 和 SVHN).
    • 模型: toy dataset - 两层MLP, 每层100个结点; MNIST - MLP; CIFAR-10 - ResNet-20 V1. 在回归问题中, 所有数据集均使用MLP.
  • @article{DBLP:journals/corr/abs-2001-04694, author = {Linh Tran, Bastiaan S. Veeling, Kevin Roth, Jakub Swiatkowski, Joshua V. Dillon, Jasper Snoek, Stephan Mandt, Tim Salimans, Sebastian Nowozin, Rodolphe Jenatton}, title = {Hydra: Preserving Ensemble Diversity for Model Distillation}, journal = {CoRR}, year = {2020} }

Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. Gao, Yan et al. arXiv:2005.09310 https://arxiv.org/pdf/2005.09310v1.pdf

  • @article{DBLP:journals/corr/abs-2005-09310, author = {Yan Gao, Titouan Parcollet, Nicholas D. Lane}, title = {Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition}, journal = {CoRR}, year = {2020} }

Temporal Self-Ensembling Teacher for Semi-Supervised Object Detection. Chen, Cong et al. IEEE 2020 [code]

Dual-Teacher: Integrating Intra-domain and Inter-domain Teachers for Annotation-efficient Cardiac Segmentation. MICCAI 2020

Knowledge Distillation for Multi-task Learning. Li, WeiHong & Bilen, Hakan. arXiv:2007.06889 [project]

Adaptive Multi-Teacher Multi-level Knowledge Distillation. Liu, Yuang et al. Neurocomputing 2020 [code] https://arxiv.org/pdf/2103.04062.pdf

  • loss: $\mathcal{L} = \mathcal{L}{KD}+\alpha\mathcal{L}{angle}+\beta\mathcal{L}_{HT}$
    • $\mathcal{L}{KD}$: 对于每个样本, student需要赋予teachers的输出不同的权重, student中fc之前的表示经过maxpooling后和每个teacherfc前的表示分别做点积作为权重, teacher的加权和作为weighted target, 将weighted target与student的soft-target间的KL散度和student输出与groungtruth的交叉熵作为$\mathcal{L}{KD}$.
    • $\mathcal{L}{angle}$: 对于样本组成的三元组, 计算它们的teacher和student表示的空间相对位置, 计算二者的Huber loss作为$\mathcal{L}{angle}$.
    • $\mathcal{L}_{HT}$: 计算teacher和student中间层表示的差的二范式, student的中间层需要经过一个单层FitNet使其规模等于teacher的中间层表示.
  • 实验设置: 数据集有CIFAR-10, CIFAR-100和Tiny-ImageNet.
    • CIFAR-10, CIFAR-100: teacher使用ResNet110, VGG-19, DenseNet121, student为ResNet20; 比较不同baseline (OKD, FitNet, RKD, AvgMKD, DML) 在数据集上的表现; 比较不同baseline(OKD, AvgMKD, DML)在teacher数量为2,3,5时的表现.
    • Tiny-ImageNet: teacher(ResNet110, ResNet56, ResNet32), student - ResNet20.
  • @article{LIU2020106, author={Yuang Liu and W. Zhang and Jijie Wang}, title = {Adaptive multi-teacher multi-level knowledge distillation}, journal = {Neurocomputing}, pages = {106-113}, year = {2020} }