Please use this identifier to cite or link to this item: https://doi.org/10.1145/3240508.3240627
DC FieldValue
dc.titleDeep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval
dc.contributor.authorJing-Jing Chen
dc.contributor.authorChong-Wah Ngo
dc.contributor.authorFu-Li Feng
dc.contributor.authorTat-Seng Chua
dc.date.accessioned2020-04-28T02:07:05Z
dc.date.available2020-04-28T02:07:05Z
dc.date.issued2018-10-26
dc.identifier.citationJing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, Tat-Seng Chua (2018-10-26). Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval. ACM Multimedia Conference 2018 : 1020-1028. ScholarBank@NUS Repository. https://doi.org/10.1145/3240508.3240627
dc.identifier.isbn9781450356657
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/167278
dc.description.abstractFinding a right recipe that describes the cooking procedure for a dish from just one picture is inherently a difficult problem. Food preparation undergoes a complex process involving raw ingredients, utensils, cutting and cooking operations. This process gives clues to the multimedia presentation of a dish (e.g., taste, colour, shape). However, the description of the process is implicit, implying only the cause of dish presentation rather than the visual effect that can be vividly observed on a picture. Therefore, different from other cross-modal retrieval problems in the literature, recipe search requires the understanding of textually described procedure to predict its possible consequence on visual appearance. In this paper, we approach this problem from the perspective of attention modeling. Specifically, we model the attention of words and sentences in a recipe and align them with its image feature such that both text and visual features share high similarity in multi-dimensional space. Through a large food dataset, Recipe1M, we empirically demonstrate that understanding the cooking procedure can lead to improvement in a large margin compared to the existing methods which mostly consider only ingredient information. Furthermore, with attention modeling, we show that language-specific named-entity extraction becomes optional. The result gives light to the feasibility of performing cross-lingual cross-modal recipe retrieval with off-the-shelf machine translation engines. © 2018 Association for Computing Machinery.
dc.publisherAssociation for Computing Machinery, Inc
dc.subjectCross-modal learning
dc.subjectHierarchical attention
dc.subjectRecipe retrieval
dc.typeConference Paper
dc.contributor.departmentDEPARTMENT OF COMPUTER SCIENCE
dc.description.doi10.1145/3240508.3240627
dc.description.sourcetitleACM Multimedia Conference 2018
dc.description.page1020-1028
dc.grant.idR-252-300-002-490
dc.grant.fundingagencyInfocomm Media Development Authority
dc.grant.fundingagencyNational Research Foundation
Appears in Collections:Staff Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval.pdf5.6 MBAdobe PDF

OPEN

NoneView/Download

Google ScholarTM

Check

Altmetric


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.