美团到店研发平台/数据智能平台部与天津大学刘安安教授团队展开了“基于多模态信息抽取的菜品知识图谱构建”的科研合作,利用多模态检索实现图文食材的识别,扩展了多模态菜品食材识别的范围,提升了食材识别的准确性。该项工作提出了一个跨模态食材级数据集,该数据集提供食材及其关系有助于增强对中国烹饪的理解。介绍该工作的论文《Toward Chinese Food Understanding: a Cross-Modal Ingredient-Level Benchmark》被IEEE Transactions on Multimedia(多媒体领域权威期刊之一)收录。
Faster R-CNN是一种经典的基于卷积神经网络(CNN)的两阶段目标检测框架。在第一阶段,Faster R-CNN利用CNN提取输入图像的特征映射,然后利用区域提名网络(RPN)生成候选目标区域。在第二阶段,基于候选目标区域,利用图像区域边界框回归以及区域食材识别两个约束进行网络参数的整体更新。相比之下,YOLO (You Only Look Once)是一种单阶段目标检测算法,以其速度和效率而闻名。与Faster R-CNN不同,YOLO在一次评估中处理整个图像,同时预测多个对象的分类概率和边界框。
3.1.2 DINO[1]
DINO(DETR with Improved deNoising anchOr boxes)是一个融合对比降噪训练(contrastive way for denoising training),混合查询选择锚点初始化(mixed query selection method for anchr initialization),前向两次预测(look forward twice scheme for box prediction)的端到端Transformer框架。相比于Faster R-CNN,DINO是一个参数量更大且更高效的目标检测模型。
在端到端设置中,我们首先将食品图像和食材组合投影到公共的嵌入空间中,然后使用对比损失来约束跨模态特征对齐。对于图像编码器,受视觉-语言Transformer在各种下游任务中取得成功的启发,我们采用预训练的[49]-[51]CLIP ViT B/16作为图像特征提取器对图像特征进行编码,然后利用线性全连接层将原始图像特征投影到公共的嵌入空间中:
在这一节中,我们重新实现了几个图像backbone(ResNet-50, ViT B/16和CLIP ViT B/16)和食材backbone(分层Transformer和分层LSTM)进行性能对比。此外,还进行了两阶段实验设置,验证了食材对象和跨模态食材检索相结合的有效性。实验结果如表4.3所示,其中APS表示自适应池化策略。最后,在表4.4中,我们重新实现了两种最先进的跨模式食谱检索方法(TFood[19]和VLPCook[56]),来比较我们提出的CMIngre和Recipe 1M[32]。
[1] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to- end object detection,” arXiv preprint arXiv:2203.03605, 2022, doi:10. 48550/arXiv.2203.03605.[2] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022, doi:10.1109/ICCV48922. 2021.00986.[3] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, “Dynamic head: Unifying object detection heads with attentions,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7373–7382, doi:10.1109/CVPR46437. 2021.00729.[4] A.-A. Liu, H. Tian, N. Xu, W. Nie, Y. Zhang, and M. Kankanhalli, “Toward region-aware attention learning for scene graph generation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 7655–7666, 2021, doi:10.1109/TNNLS.2021.3086066.[5] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh, “Graph r-cnn for scene graph generation,” in Proceedings of the European confer- ence on computer vision (ECCV), 2018, pp. 670–685, doi:10.1007/ 978- 3- 030- 01246- 5 41.[6] C. Liu, Z. Mao, T. Zhang, H. Xie, B. Wang, and Y. Zhang, “Graph structured network for image-text matching,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 921–10 930, doi:10.1109/CVPR42600.2020.01093.[7] H.Diao,Y.Zhang,L.Ma,andH.Lu,“Similarityreasoningandfiltration for image-text matching,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 2, 2021, pp. 1218–1226, doi:10.1609/ aaai.v35i2.16209.[8] Y. Wang, Y. Su, W. Li, J. Xiao, X. Li, and A.-A. Liu, “Dual-path rare content enhancement network for image and text matching,” IEEE Transactions on Circuits and Systems for Video Technology, 2023, doi:10.1109/TCSVT.2023.3254530.[9] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in Computer Vision– ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 2014, pp. 446–461, doi:10.1007/978- 3- 319- 10599- 429.[10] J.ChenandC.-W.Ngo,“Deep-based ingredient recognition for cooking recipe retrieval,” in Proceedings of the 24th ACM international confer-ence on Multimedia, 2016, pp. 32–41, doi:10.1145/2964284.2964315.[11] W. Min, L. Liu, Z. Wang, Z. Luo, X. Wei, X. Wei, and S. Jiang, “Isia food-500: A dataset for large-scale food recognition via stacked global-local attention network,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 393–401, doi:10.1145/3394171. 3414031.[12] W. Min, Z. Wang, Y. Liu, M. Luo, L. Kang, X. Wei, X. Wei, and S. Jiang, “Large scale visual food recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, doi:10.1109/TPAMI.2023.3237871.[13] E. Aguilar, B. Remeseiro, M. Bolan ̃os, and P. Radeva, “Grab, pay, and eat: Semantic food detection for smart restaurants,” IEEE Transactions on Multimedia, vol. 20, no. 12, pp. 3266–3275, 2018, doi:10.1109/TMM.2018.2831627.[14] R. Morales, J. Quispe, and E. Aguilar, “Exploring multi-food detection using deep learning-based algorithms,” in 2023 IEEE 13th International Conference on Pattern Recognition Systems (ICPRS), 2023, pp. 1–7, doi:10.1109/ICPRS58416.2023.10179037.[15] G. Ciocca, P. Napoletano, and R. Schettini, “Food recognition: a new dataset, experiments, and results,” IEEE journal of biomedical and health informatics, vol. 21, no. 3, pp. 588–598, 2016, doi:10.1109/JBHI. 2016.2636441.[16] A. Salvador, N. Hynes, Y. Aytar, J. Marin, F. Ofli, I. Weber, and A. Tor-ralba, “Learning cross-modal embeddings for cooking recipes and food images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3020–3028, doi:10.1109/CVPR.2017.327.[17] A. Salvador, E. Gundogdu, L. Bazzani, and M. Donoser, "Revamping cross-modal recipe retrieval with hierarchical transformers and self-supervised learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 475-15 484, do: 10.1109/CVPR46437.2021.01522.[18] M. Carvalho, R. Cadène, D. Picard, L. Soulier, N. Thome, and M. Cord, "Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings," in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 35-44, doi: 10.1145/3209978.3210036.[19] M. Shukor, G. Couairon, A. Grechka, and M. Cord, *Transformer decoders with multimodal regularization for cross-modal food retrieval," in Proceedings of the IEEE/CV Conference on Computer Vision and Pattern Recognition, 2022, pp. 4567-4578, doi: 10.1109/CVPRW56347.2022.00503.[20] H. Wang, D. Sahoo, C. Liu, K. Shu, P. Achananuparp, E.-p. Lim, and S. C. Hoi, "Cross-modal food retrieval: learning a joint embedding of food images and recipes with semantic consistency and attention mechanism," IEEE Transactions on Multimedia, vol. 24, pp. 2515-2525, 2021, doi: 10.1 109/TMM.2021.3083109.[21] M. Li, P.-Y. Huang, X. Chang, J. Hu, Y. Yang, and A. Hauptmann, "Video pivoting unsupervised multi-modal machine translation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3918-3932, 2023, doi: 10.1109/TPAMI.2022.3181116.[22] Chinese cuisine culture, Last accessed on June 23, 2023.[23] "Regulation of food composition data expression," https://www.chinanutri.cn/fgbz/fgbzhybz/201707/P020170721479798369359.pdf,Last accessed on June 23, 2023.[24] T. Joutou and K. Yanai, *A food image recognition system with multiple kernel learning," in 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 2009, pp. 285-288, doi: 10.1109/ICIP.2009.5413400.[25] Y. Kawano and K. Yanai, "Food image recognition with deep con-volutional features," in Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, 2014, pp. 589-593, doi: 10.1145/2638728.2641339.[26] K. Yanai and Y. Kawano, "Food image recognition using deep con-volutional network with pre-training and fine-tuning," in 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEBE, 2015, p. 1-6, doi: 10.1109/ICMEW.2015.7169816.[27] M. T. Turan and E. Erzin, "Domain adaptation for food intake classification with teacher/student learning," IEEE Transactions on Multimedia, vol. 23, pp. 4220 4231, 2020, doi: 10.1109/TMM.2020.3038315.[28] H. Liang, G. Wen, Y. Hu, M. Luo, P. Yang, and Y. Xu, "Mvanet: Multitask guided multi-view attention network for chinese food recognition," IEEE Transactions on Multimedia, vol. 23, pp. 3551-3561, 2020, doi: 10.1109/TMM.2020.3028478.[29] J. He, L. Lin, H. A. Eicher-Miller, and F. Zhu, "Long-tailed food clas-sification," Nutrients, vol. 15, no. 12, 2023, doi: 10.3390/nu15122751.[30] K. Aizawa, Y. Maruyama, H. Li, and C. Morikawa, "Food balance estimation by using personal dietary tendencies in a multimedia food log," IEEE Transactions on multimedia, vol. 15, no. 8, pp. 2176-2185, 2013, doi: 10.1109/TMM.2013.2271474.[31] J.-J. Chen, C.-W. Ngo, F.-L. Feng, and T.-S. Chua, "Deep understanding of cooking procedure for cross-modal recipe retrieval," in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp.1020-1028, do: 10.1145/3240508.3240627.[32] Y.-C. Lien, H. Zamani, and W. B. Croft,"Recipe retrieval with visual query of ingredients," in Proceedings of the 43rd International ACM SI-GIR Conference on Research and Development in Information Retrieval, 2020, pp. 1565-1568, do: 10.1145/3397271.3401244.[33] W. Min, B.-K. Bao, S. Mei, Y. Zhu, Y. Rui, and S. Jiang. "You are what you eat: Exploring rich recipe information for cross-region food analysis," IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 950-964, 2017, doi: 10.1109/TMM.2017.2759499.[34] G. Ciocca, P. Napoletano, and R. Schettini, *Learning cnn-based features for retrieval of food images," in New Trends in Image Analysis and Processing-ICIAP 2017: ICIAP International Workshops, WBICV, SSPandBE, 3AS, RGBD, NIVAR, IWBAAS, and MADiMa 2017, Catania, Italy, September 11-15, 2017, Revised Selected Papers 19. Springer, 2017, pp. 426 434, doi: 10.1007/978-3-319-70742-6_41.[35] X. Chen, Y. Zhu, H. Zhou, L. Diao, and D. Wang, "Chinesefoodnet: A large-scale image dataset for chinese food recognition," arXiv preprint arXiv: 1705.02743, 2017, doi: 10.48550/arXiv. 1705.02743.[36] S. Hou, Y. Feng, and Z. Wang, "Vegfru: A domain-specific dataset for fine-grained visual categorization," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 541-549, doi:10.1109/ICCV.2017.66.[37] J. Qiu, F. P.-W. Lo, Y. Sun, S. Wang, and B. Lo, "Mining discriminative food regions for accurate food recognition," arXiv preprint arXiv:2207.03692, 2022, doi: 10.48550/arXiv.2207.03692.[38] J. Wang, X. Ding, and B. Guo, "High precision food detection method based on deep object detection network," in 2021 IEEE Sth Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), vol. 5. IEEE, 2021, pp. 646-650, doi: 10.1109/ITNEC52019. 2021.9587189.[39] $. Akti, M. Qarage, and H. K. Ekenel, "A mobile food recognition system for dietary assessment," in International Conference on Image Analysis and Processing. Springer, 2022, pp. 71-81, doi: 10.1007/978-3-031-13321-3_7.[40] Y. Matsuda, H. Hoashi, and K. Yanai, "Recognition of multiple-food images by detecting candidate regions," in 2012 IEEE International Conference on Multimedia and Expo.IEEE, 2012, pp. 25-30, doi: 10.1109/ICME.2012.157.[41] Y. Kawano and K. Yanai, "Foodcam-256: a large-scale real-time mobile food recognifionsystem employing high-dimensional features and compression of classifier weights," in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 761-762, doi:10.1145/2647868.2654869.[42] B. Muñoz, I. Chirino, and E. Aguilar, "Can deep learning models recognize chilean diet," IEEE Latin America Transactions, vol. 20, no. 9, pp. 2131-2138, 2022, doi:10.1109 TLA.2022.9878168.[43] Y. Kawano and K. Yanai, "Automatic expansion of a food image dataset leveraging existing categories with domain adaptation," in Computer Vision - ECCV 2014 Workshops, 2015, pp. 3-17, doi:10.1007/ 978-3-319-16199-0_1.[44] J. Chen, L. Pang, and C.-W. Ngo, "Cross-modal recipe retrieval: How to cook this dish?" in MultiMedia Modeling: 23rd International Conference, MMM 2017, Reykiavil, Iceland, January 4-6, 2017, Pro-ceedings, Part I 23.978-3-319-51811-4_48. Springer, 2017, pp. 588-600, doi: 10.1007/[45] X. Li, J. Feng, Y. Meng, Q. Han, F. Wu, and J. Li, "A unified MRC framework for named entity recognition," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 5849-5859, doi: 10.18653/V1/ 2020.ACL-MAIN.519.[46] Y. Kawano and K. Yanai, "Automatic expansion of a food image dataset leveraging existing categories with domain adaptation," in Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part III 13. Springer, 2015, pp. 3-17, doi: 10.1007/978-3-319-16199-0_1.[47] 1 S. Ren, K. He, R. Girshick, and J. Sun, "Faster I-cnn: Towards real-time object detection with region proposal networks, "Advances in neural information processing systems, vol. 28, 2015, do: 10.1109/TPAMI. 2016.2577031.[48] G. Jocher, "Yolov5 by ultralytics,* 2020, doi: 10.5281/zenodo.3908559.[Online]. Available: https://github.com/ultralytics/yolov5[49] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., *Learning transferable visual models from natural language supervision," in International conference on machine learning. PMLR, 2021, pp. 8748-8763, doi: 10.48550/arXiv.2103.00020.[50] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang,L. Yuan, L. Zhang, J.-N. Hwang et al., "Grounded language-image pre-training," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965-10 975, doi: 10.48550/ arXiv.2112.03857.[51] H. Zhang, P. Zhang, x. Hu, Y-C. Chen, L. Li, x. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, "Glipv2: Unifying localization and vision-language understanding," Advances in Neural Information Processing Systems, vol. 35, pp. 36 067-36080, 2022, doi: 10.48550/arXiv.2206.05836.[52] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, "Stacked cross attention for image-text matching," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 201-216, doi: 10.1007/ 978-3-030-01225-0_13.[53] J. Chen, H. Hu, H. Wu, Y. Jiang, and C. Wang, "Learning the best pooling strategy for visual semantic embedding," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, doi: 10.1109/ CVPR46437.2021.01553.[54] J. Yang, J. Lu, D. Batra, and D. Parikh, "A faster pytorch implementation of faster r-cnn. "https://github.com/jwyang/faster-renn.pytorch, 2017.[55] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang,*Bottom-up and top-down attention for image captioning and visual question answering," in CVPR, 2018, doi: 10.1109/CVPR.2018.00636.[56] M. Shukor, N. Thome, and M. Cord, "Vision and structured-language pretraining for cross-modal food retrieval," Available at SSRN 4511116, 2023, doi: 10.48550/arXiv.2212.04267[57] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan et al., "Microsoft coco: Common objects in context," in European conference on computer vision. Springer, 2014, pp.1740-755, doi: 10.1007/978-3-319-10602-1_48.[58] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020, doi:10.48550/arXiv.2010.11929.[59] Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, and Y.-D. Shen, "Dual-path convolutional image-text embeddings with instance loss," ACM Trans. Multimedia Comput. Commun. Appl., vol. 16, no. 2, 2020, doi: 10.1145/3383184.