{"id":61277,"date":"2021-12-06T15:29:50","date_gmt":"2021-12-06T06:29:50","guid":{"rendered":"https:\/\/smilegate.ai\/?p=61277"},"modified":"2021-12-06T16:34:02","modified_gmt":"2021-12-06T07:34:02","slug":"microsoft-_nuwa-visual-synthesis-pre-training-for-neural-visual-world-creation","status":"publish","type":"post","link":"https:\/\/smilegate.ai\/en\/2021\/12\/06\/microsoft-_nuwa-visual-synthesis-pre-training-for-neural-visual-world-creation\/","title":{"rendered":"Microsoft _NUWA : Visual Synthesis Pre-training for Neural visUal World creAtion"},"content":{"rendered":"

[Convergence Research Team, Jihyun Song]<\/strong><\/p>\n\n\n\n

Microsoft\ub294 \uae30\uc874\uc758 \uc2dc\uac01\ub370\uc774\ud130(image and video)\ub97c \ud65c\uc6a9\ud558\uc5ec \uc0c8\ub85c\uc6b4 \uc2dc\uac01\ub370\uc774\ud130\ub97c \uc0dd\uc131\ud558\uace0 \uc870\uc791\ud560 \uc218 \uc788\ub294 multimodal pretrained model\uc778 NUWA\ub97c \ubc1c\ud45c\ud588\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n

\uc544\ub798 \uadf8\ub9bc\uc740 8\uac1c\uc758 \ub2e4\uc6b4\uc2a4\ud2b8\ub9bc\uc5d0 \ub300\ud55c \uc2dc\uac01\uc801 \ud569\uc131 \uae30\uc220\uc744 \uc5f0\uad6c\ud55c \uacb0\uacfc\uc785\ub2c8\ub2e4.<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

\uc11c\ub85c \ub2e4\ub978 \uc2dc\ub098\ub9ac\uc624\uc5d0 \ub300\ud574 text, image, video\ub97c \ub3d9\uc2dc\uc5d0 \ub2e4\ub8e8\uae30 \uc704\ud574 3D transformer encoder-decoder framework\ub97c \uc124\uacc4 \ud588\uc2b5\ub2c8\ub2e4. \uc774 \ud504\ub808\uc784\uc6cc\ud06c\ub294 video\ub97c 3D \ub370\uc774\ud130\ub85c \ucc98\ub9ac\ud560 \uc218 \uc788\uc744 \ubfd0\ub9cc \uc544\ub2c8\ub77c \ud14d\uc2a4\ud2b8\uc640 \uc774\ubbf8\uc9c0\uc5d0 1D\uc640 2D\uc778 \ub370\uc774\ud130\ub85c \uc801\uc751\ub3c4 \uac00\ub2a5\ud558\uac8c \ucc98\ub9ac\ud560 \uc218 \uc788\uc2b5\ub2c8\ub2e4. \uacc4\uc0b0 \ubcf5\uc7a1\uc131\uc744 \uc904\uc774\uae30 \uc704\ud574 3D Nearby Attention(3DNA) \uba54\ucee4\ub2c8\uc998\ub3c4 \uc81c\uc548\ud588\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n

NUWA\ub294 \ud14d\uc2a4\ud2b8-\uc774\ubbf8\uc9c0 \uc0dd\uc131\uc5d0 \ub300\ud55c \ucd5c\uc2e0 \uc5f0\uad6c \uacb0\uacfc\uc640 \ube44\uad50\ud558\uc5ec, \ud14d\uc2a4\ud2b8 \ub300 \ube44\ub514\uc624 \uc0dd\uc131, \ube44\ub514\uc624 \uc608\uce21\uc5d0\uc11c\ub3c4 \uc6b0\uc218\ud55c \uc131\ub2a5\uc744 \ubcf4\uc5ec\uc92c\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n

\ub610\ud55c, NUWA\ub294 \uac00\uc774\ub4dc \ub41c \ud14d\uc2a4\ud2b8\uc5d0 \ub300\ud574 \uc774\ubbf8\uc9c0 \uc870\uc791 \ubfd0\ub9cc \uc544\ub2c8\ub77c \ube44\ub514\uc624 \uc870\uc791\uc5d0\uc11c\ub3c4 \ub180\ub77c\uc6b8 \uc815\ub3c4\ub85c \uc6b0\uc218\ud55c zero-shot \uc131\ub2a5\uc744 \ubcf4\uc5ec\uc92c\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n

Related works<\/strong><\/p>\n\n\n\n

\u2022 Visual Auto-Regressive Models : \ubcf8 \uc5f0\uad6c\uc5d0\uc11c\ub294 \ube44\uc8fc\uc5bc \ud1a0\ud070\ud654\ub97c \uc704\ud574 NUWA\uc5d0\uc11c VQ-VAE \ub300\uc2e0 VQ-GAN\uc774 \uc0ac\uc6a9\ub418\ub294\ub370, \uc774\ub294 \uc2e4\ud5d8\uc744 \uae30\ubc18\uc73c\ub85c \ub354 \ub098\uc740 \uc0dd\uc131 \ud488\uc9c8\ub85c \uc774\uc5b4\uc9c0\uac8c \ud558\ub294 \uac83\uc744 \ubcfc \uc218 \uc788\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n

\u2022 Visual Sparse Self-Attention : \ubcf8 \uc5f0\uad6c\uc5d0\uc11c\ub294 \uc2dc\uac01\uc801 \uc0dd\uc131\uc5d0 \ub300\ud574 local-wise sparse attention\uac00 axial-wise sparse attention \ubcf4\ub2e4 \uc6b0\uc218\ud558\ub2e4\ub294 \uac83\uc744 \uac80\uc99d\ud569\ub2c8\ub2e4.<\/p>\n\n\n\n

Method<\/strong><\/p>\n\n\n\n

\u2022 3D Data Representation : \ubaa8\ub4e0 text, image, video \ub610\ub294 \uadf8\ub4e4\uc758 sketch\ub97c \ud3ec\ud568\ud558\uae30 \uc704\ud574 \ubcf8 \uc5f0\uad6c\ub294 \ubaa8\ub4e0 \uac83\uc744 \ud1a0\ud070\uc73c\ub85c \ubcf4\uace0 \ud1b5\uc77c\ub41c 3D \ud45c\uae30\ubc95\uc744 \uc815\uc758\ud569\ub2c8\ub2e4.(\uc544\ub798 \uadf8\ub9bc \ucc38\uace0)<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

\u2022 3D Nearby Self-Attention: \uc774\uc804\uc758 3D \ub370\uc774\ud130 \ud45c\ud604\uc744 \uae30\ubc18\uc73c\ub85c \ud55c 3D Self-attention(3DNA) \ubaa8\ub4c8, self-attention and cross-attention \uc9c0\uc6d0\ud569\ub2c8\ub2e4. <\/p>\n\n\n\n

\u2022 3D Encoder-Decoder: 3DNA\ub97c \uae30\ubc18\uc73c\ub85c \uad6c\ucd95\ub41c 3D \uc778\ucf54\ub529-\ub514\ucf54\ub354\ub97c \ub3c4\uc785\ud569\ub2c8\ub2e4.<\/p>\n\n\n\n

Qualitative comparison with state-of-the-art models for Text-to-Image (T2I) task on MSCOCO dataset<\/strong><\/p>\n\n\n\n

(\uc544\ub798 \uadf8\ub9bc\uc5d0\uc11c \ube68\uac04 \ubc15\uc2a4: input\uc5d0 \ub300\ud55c NUWA\uc758 \uacb0\uacfc)<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

Quantitative comparison with state-of-the-art models for Text-to-Video (T2V) task on Kinetics dataset<\/strong><\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

NUWA\ub294 \ub2e4\ub978 \ucd5c\uc2e0 \ubaa8\ub378\ub4e4\uc5d0 \ube44\ud574 \uc131\ub2a5\uc774 \ub6f0\uc5b4\ub098\uace0, \uc2dc\uac01\uc801\uc778 \ubd84\uc57c\uc758 \ucc3d\uc870\ub97c \uac00\ub2a5\ud558\uac8c \ud558\uace0 \ucf58\ud150\uce20 \uc81c\uc791\uc790\ub97c \ub3d5\uae30 \uc704\ud55c AI \ud50c\ub7ab\ud3fc \uad6c\ucd95\uc744 \ud5a5\ud55c \ubcf8 \uc5f0\uad6c\uc758 \uccab \ubc88\uc9f8 \ub2e8\uacc4\uc785\ub2c8\ub2e4.<\/p>\n\n\n\n

Microsoft\uac00 \uc5f0\uad6c\ud55c NUWA\ub97c \ud544\ub450\ub85c text, image, video \ub4f1\uc744 \ub3d9\uc2dc\uc5d0 \uace0\ub824\ud55c \ud504\ub808\uc784\uc6cc\ud06c\uc758 \uac1c\ubc1c\uacfc \uacf5\uac04 \ubc0f \uc2dc\uac01 \ucd95 \ubaa8\ub450\uc758 \uc778\uc811 \ud2b9\uc131\uc744 \uace0\ub824\ud558\ub294 nearby-sparse attention mechanism \uadf8\ub9ac\uace0 \uc704\uc5d0\uc11c \ubcf4\uc5ec\uc8fc\ub294 8\uac00\uc9c0 \ud569\uc131\uc5d0 \ub300\ud55c \ud3ec\uad04\uc801\uc778 \uc2e4\ud5d8\uc744 \ud1b5\ud574 \uc218 \ub9ce\uc740 \ubc1c\uc804\uacfc \uae30\uc5ec\uc810\uc5d0 \ub300\ud574 \ucc2c\uc0ac\ub97c \ubcf4\ub0c5\ub2c8\ub2e4. <\/p>\n\n\n\n

\ucc38\uace0 \uc790\ub8cc<\/strong>: https:\/\/github.com\/microsoft\/NUWA<\/a>
https:\/\/arxiv.org\/abs\/2111.12417<\/a><\/p>\n\n\n\n

<\/p>\n

<\/span><\/div>","protected":false},"excerpt":{"rendered":"

[\uc735\ud569\uc5f0\uad6c\ud300 \uc1a1\uc9c0\ud604] Microsoft\ub294 \uae30\uc874\uc758 \uc2dc\uac01\ub370\uc774\ud130(image and video)\ub97c \ud65c\uc6a9\ud558\uc5ec \uc0c8\ub85c\uc6b4 \uc2dc\uac01\ub370\uc774\ud130\ub97c \uc0dd\uc131\ud558\uace0 \uc870\uc791\ud560 \uc218 \uc788\ub294 multimodal pretrained model\uc778 NUWA\ub97c \ubc1c\ud45c\ud588\uc2b5\ub2c8\ub2e4. \uc544\ub798 \uadf8\ub9bc\uc740 8\uac1c\uc758 \ub2e4\uc6b4\uc2a4\ud2b8\ub9bc\uc5d0 \ub300\ud55c \uc2dc\uac01\uc801 \ud569\uc131 \uae30\uc220\uc744 \uc5f0\uad6c\ud55c \uacb0\uacfc\uc785\ub2c8\ub2e4. \uc11c\ub85c \ub2e4\ub978 \uc2dc\ub098\ub9ac\uc624\uc5d0 \ub300\ud574 text, image, video\ub97c \ub3d9\uc2dc\uc5d0 \ub2e4\ub8e8\uae30 \uc704\ud574 3D transformer encoder-decoder framework\ub97c \uc124\uacc4 \ud588\uc2b5\ub2c8\ub2e4. \uc774 \ud504\ub808\uc784\uc6cc\ud06c\ub294 video\ub97c 3D \ub370\uc774\ud130\ub85c \ucc98\ub9ac\ud560 \uc218 \uc788\uc744 \ubfd0\ub9cc \uc544\ub2c8\ub77c…<\/p>\n

<\/span><\/div>","protected":false},"author":1,"featured_media":61278,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_lock_modified_date":false,"footnotes":""},"categories":[16,18,19],"tags":[503,540,541],"class_list":["post-61277","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech01","category-tech03","category-tech04","tag-multimodal","tag-540","tag-nuwa","category-16","category-18","category-19","description-off"],"_links":{"self":[{"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/posts\/61277","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/comments?post=61277"}],"version-history":[{"count":5,"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/posts\/61277\/revisions"}],"predecessor-version":[{"id":61288,"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/posts\/61277\/revisions\/61288"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/media\/61278"}],"wp:attachment":[{"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/media?parent=61277"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/categories?post=61277"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/smilegate.ai\/en\/wp-json\/wp\/v2\/tags?post=61277"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}