Learning Using Privileged Information for Food Recognition

Please use this identifier to cite or link to this item: https://doi.org/10.1145/3343031.3350870

DC Field	Value
dc.title	Learning Using Privileged Information for Food Recognition
dc.contributor.author	Lei Meng
dc.contributor.author	Long Chen
dc.contributor.author	Xun Yang
dc.contributor.author	Dacheng Tao
dc.contributor.author	Hanwang Zhang
dc.contributor.author	Chunyan Miao
dc.date.accessioned	2020-05-05T03:44:24Z
dc.date.available	2020-05-05T03:44:24Z
dc.date.issued	2019-10-21
dc.identifier.citation	Lei Meng, Long Chen, Xun Yang, Dacheng Tao, Hanwang Zhang, Chunyan Miao (2019-10-21). Learning Using Privileged Information for Food Recognition. ACM MM 2019 : 557-565. ScholarBank@NUS Repository. https://doi.org/10.1145/3343031.3350870
dc.identifier.isbn	9781450368896
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/167714
dc.description.abstract	Food recognition for user-uploaded images is crucial in visual diet tracking, an emerging application linking multimedia and healthcare domains. However, it is challenging due to the various visual appearances of food images. This is caused by different conditions when taking the photos, such as angles, distances, light conditions, food containers, and background scenes. To alleviate such a semantic gap, this paper presents a cross-modal alignment and transfer network (ATNet), which is motivated by the paradigm of learning using privileged information (LUPI). It additionally utilizes the ingredients in food images as an “intelligent teacher” in the training stage to facilitate cross-modal information passing. Specifically, ATNetfi rst uses a pair of synchronized autoencoders to build the base image and ingredient channels for informationfl ow. Subsequently, the information passing is enabled through a two-stage cross-modal interaction. Thefi rst stage of interaction adopts a two-step method, called partial heterogeneous transfer, to 1) alleviate the intrinsic heterogeneity between images and ingredients and 2) align them in a shared space to make their carried information about food classes interact. In the second stage, ATNet learns to map the visual embeddings of images to the ingredient channel for food recognition from the view of “teacher”. This leads a refined recognition by a multi-view fusion. Experiments on two real-world datasets show that ATNet can be incorporated with any state-of-the-art CNN models to consistently improve their performance. © 2019 Association for Computing Machinery.
dc.subject	Cross-modal fusion
dc.subject	Food recognition
dc.subject	Heterogeneous feature alignment
dc.subject	Learning using privileged information
dc.type	Conference Paper
dc.contributor.department	DEPARTMENT OF COMPUTER SCIENCE
dc.description.doi	10.1145/3343031.3350870
dc.description.sourcetitle	ACM MM 2019
dc.description.page	557-565
dc.grant.id	R-252-300-002-490
dc.grant.fundingagency	Infocomm Media Development Authority
dc.grant.fundingagency	National Research Foundation
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
3343031.3350870.pdf		2.17 MB	Adobe PDF	OPEN	None	View/Download

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Altmetric

Google Scholar^TM