Please use this identifier to cite or link to this item:
https://doi.org/10.1145/3240508.3240627
DC Field | Value | |
---|---|---|
dc.title | Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval | |
dc.contributor.author | Jing-Jing Chen | |
dc.contributor.author | Chong-Wah Ngo | |
dc.contributor.author | Fu-Li Feng | |
dc.contributor.author | Tat-Seng Chua | |
dc.date.accessioned | 2020-04-28T02:07:05Z | |
dc.date.available | 2020-04-28T02:07:05Z | |
dc.date.issued | 2018-10-26 | |
dc.identifier.citation | Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, Tat-Seng Chua (2018-10-26). Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval. ACM Multimedia Conference 2018 : 1020-1028. ScholarBank@NUS Repository. https://doi.org/10.1145/3240508.3240627 | |
dc.identifier.isbn | 9781450356657 | |
dc.identifier.uri | https://scholarbank.nus.edu.sg/handle/10635/167278 | |
dc.description.abstract | Finding a right recipe that describes the cooking procedure for a dish from just one picture is inherently a difficult problem. Food preparation undergoes a complex process involving raw ingredients, utensils, cutting and cooking operations. This process gives clues to the multimedia presentation of a dish (e.g., taste, colour, shape). However, the description of the process is implicit, implying only the cause of dish presentation rather than the visual effect that can be vividly observed on a picture. Therefore, different from other cross-modal retrieval problems in the literature, recipe search requires the understanding of textually described procedure to predict its possible consequence on visual appearance. In this paper, we approach this problem from the perspective of attention modeling. Specifically, we model the attention of words and sentences in a recipe and align them with its image feature such that both text and visual features share high similarity in multi-dimensional space. Through a large food dataset, Recipe1M, we empirically demonstrate that understanding the cooking procedure can lead to improvement in a large margin compared to the existing methods which mostly consider only ingredient information. Furthermore, with attention modeling, we show that language-specific named-entity extraction becomes optional. The result gives light to the feasibility of performing cross-lingual cross-modal recipe retrieval with off-the-shelf machine translation engines. © 2018 Association for Computing Machinery. | |
dc.publisher | Association for Computing Machinery, Inc | |
dc.subject | Cross-modal learning | |
dc.subject | Hierarchical attention | |
dc.subject | Recipe retrieval | |
dc.type | Conference Paper | |
dc.contributor.department | DEPARTMENT OF COMPUTER SCIENCE | |
dc.description.doi | 10.1145/3240508.3240627 | |
dc.description.sourcetitle | ACM Multimedia Conference 2018 | |
dc.description.page | 1020-1028 | |
dc.grant.id | R-252-300-002-490 | |
dc.grant.fundingagency | Infocomm Media Development Authority | |
dc.grant.fundingagency | National Research Foundation | |
Appears in Collections: | Staff Publications Elements |
Show simple item record
Files in This Item:
File | Description | Size | Format | Access Settings | Version | |
---|---|---|---|---|---|---|
Deep Understanding of Cooking Procedure for Cross-modal Recipe Retrieval.pdf | 5.6 MB | Adobe PDF | OPEN | None | View/Download |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.