{"id":61023,"date":"2021-10-24T14:24:59","date_gmt":"2021-10-24T05:24:59","guid":{"rendered":"https:\/\/smilegate.ai\/?p=61023"},"modified":"2021-10-24T14:26:29","modified_gmt":"2021-10-24T05:26:29","slug":"deep-learning-optimized-learning","status":"publish","type":"post","link":"https:\/\/smilegate.ai\/cn\/2021\/10\/24\/deep-learning-optimized-learning\/","title":{"rendered":"Deep learning? Optimized learning!"},"content":{"rendered":"
\"\"<\/figure><\/div>\n\n\n\n

[\u524d\u7814\u7a76\u7ec4\u91d1\u6210yun]<\/p>\n\n\n\n

\uc790\uc5f0\uc5b4\ucc98\ub9ac \ubd84\uc57c\uc5d0\uc11c pre-trained language model (PLM) \uc804\ub7b5\uc774 \ud6cc\ub96d\ud55c \uc131\uacf5\uc744 \uac70\ub450\uc790, \ub354 \ub9ce\uc740 \ub370\uc774\ud130\ub97c \uc774\uc6a9\ud574 \ub354 \ud070 PLM\uc744 \uac1c\ubc1c\ud558\ub294 \uac83\uc774 \ud558\ub098\uc758 \ud2b8\ub79c\ub4dc\ub85c \uc790\ub9ac\uc7a1\uc558\uc2b5\ub2c8\ub2e4.
\uadf8\ub9ac\uace0 \uc5bc\ub9c8 \uc804, NVIDIA\uc5d0\uc11c\ub294 GPT-3\uc758 \ubb34\ub824 4\ubc30 \uac00\uae4c\uc774 \ub418\ub294 530B\uac1c\uc758 \ud30c\ub77c\ubbf8\ud130\uc9dc\ub9ac \ubaa8\ub378\uc744 \uacf5\uac1c\ud588\uc2b5\ub2c8\ub2e4.
\uc774 \ubaa8\ub378\uc740 \uae30\uc874\uc758 Megatron-LM \ubaa8\ub378\uacfc Turing-NLG \ubaa8\ub378\uc744 \uacb0\ud569\ud558\uc5ec, “Megatron-Turing NLG” (MT-NLG) \ub77c\ub294 \uc774\ub984\uc73c\ub85c \uba85\uba85\ub410\uc2b5\ub2c8\ub2e4.
\ubaa8\ub378\uc758 \ud559\uc2b5\uc740 DGX A100 80G \uc11c\ubc84 560\ub300\ub97c \ud558\ub098\uc758 \ud074\ub7ec\uc2a4\ud130\ub85c \ubb36\uc5b4\uc11c \ud559\uc2b5\ud588\ub2e4\uace0 \ud569\ub2c8\ub2e4. \uc815\ub9d0 NVIDIA\uac00 \uc544\ub2c8\uace0\uc11c\ub294 \uc2e4\ud5d8\ub3c4 \ubd88\uac00\ub2a5\ud560 \uc815\ub3c4\uc758 \ubaa8\ub378\uc774\ub124\uc694!<\/p>\n\n\n\n

\"Chart<\/figure><\/div>\n\n\n\n

\ucd1d 105\uac1c\uc758 transformer layer\ub85c \uad6c\uc131\ub418\uc5b4 \uc788\uace0, zero-, one- \uadf8\ub9ac\uace0 few-shot learning task\uc5d0\uc11c \ucd5c\uace0\uc758 \uc131\ub2a5\uc744 \ubcf4\uc600\ub2e4\uace0 \ud569\ub2c8\ub2e4.<\/p>\n\n\n\n

\uc774\ub807\uac8c \ud070 \ubaa8\ub378\uc744 \ud559\uc2b5\ud558\ub294\ub370\ub294 \ub2e8\uc21c\ud788 \ub9ce\uc740 \ub3c8, \ub9ce\uc740 \ub370\uc774\ud130, \ub9ce\uc740 GPU\ub9cc\uc744 \ud544\uc694\ub85c \ud558\uc9c0 \uc54a\uc2b5\ub2c8\ub2e4.
\uc544\ub798\uc758 \ubb38\uc81c\ub4e4 \ub54c\ubb38\uc778\ub370\uc694, \uc6b0\uc120 (1) GPU\uc758 \uba54\ubaa8\ub9ac\ub294 \ud55c\uc815\ub418\uc5b4 \uc788\uace0, \uc5c4\uccad \ud070 hyper parameter\ub97c \ubaa8\ub450 \ud559\uc2b5\ud558\ub294\ub370\ub294 \uc808\ub300 \ucda9\ubd84\ud558\uc9c0 \uc54a\uc2b5\ub2c8\ub2e4. (2) \ud559\uc2b5 \uc54c\uace0\ub9ac\uc998 \ucd5c\uc801\ud654, \ub370\uc774\ud130 \ucc98\ub9ac \ubc29\ubc95, \uc18c\ud504\ud2b8\uc6e8\uc5b4-\ud558\ub4dc\uc6e8\uc5b4 \ucd5c\uc801\ud654\ub97c \ubaa8\ub450 \uace0\ub824\ud558\uc9c0 \uc54a\uc73c\uba74, \ube44\ud604\uc2e4\uc801\uc73c\ub85c \ud559\uc2b5\uc2dc\uac04\uc774 \uc624\ub798 \uac78\ub9b4 \uc218 \uc788\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n

\uc774\ubc88\uc5d0 \uacf5\uac1c\ub41c MT-NLG\uc758 \uacbd\uc6b0, Microsoft\uc640 NVIDIA\uac00 \ud611\uc5c5\ud558\uc5ec \uc804\ub840\uc5c6\ub294 \ubaa8\ub378 \ud559\uc2b5 \ud6a8\uc728\uc744 \ub2ec\uc131\ud574\uc11c \ub9cc\ub4e4\uc5b4\ub0bc \uc218 \uc788\uc5c8\ub2e4\uace0 \ud569\ub2c8\ub2e4 \ud83d\ude42
\uc989, \ud558\ub4dc\uc6e8\uc5b4\uc640 \uc18c\ud504\ud2b8\uc6e8\uc5b4\uc758 \uc2dc\uc2a4\ud15c \uad6c\uc870\uae4c\uc9c0 \ubaa8\ub450 \ud30c\uc545\ud558\uace0 \uc788\uc5b4\uc57c \ud6a8\uc728\uc801\uc778 \ud559\uc2b5\uc774 \uac00\ub2a5\ud558\ub2e4\ub294 \uac70\uaca0\uc8e0?
\ub354 \uc790\uc138\ud55c \uc774\uc57c\uae30\ub294 (\u5173\u8054<\/a>) \uc5d0\uc11c \ud655\uc778\ud574\ubcf4\uc2e4 \uc218 \uc788\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n


\n\n\n\n

\ud559\uc2b5 \ucd5c\uc801\ud654 \uad00\ub828\ud574\uc11c \ucd94\uac00\ub85c \uc7ac\ubc0c\uac8c \uc77d\uc740 \ub17c\ubb38<\/a>\uc774 \uc788\uc5b4\uc11c \uacf5\uc720\ub4dc\ub9bd\ub2c8\ub2e4.
\uc81c\ubaa9(How to train BERT with an academic budget)\uc5d0\uc11c \uc54c \uc218 \uc788\ub4ef\uc774, BERT\uac19\uc740 large scaled model\ub4e4\uc744 \uc5b4\ub5bb\uac8c \ucd5c\uc801\ud654\ud558\uc5ec \uc800\ub834\ud558\uac8c \ud559\uc2b5\ud560 \uc218 \uc788\ub294\uc9c0\uc5d0 \uad00\ud55c \ub17c\ubb38\uc785\ub2c8\ub2e4.<\/p>\n\n\n\n

\ubcf8 \ub17c\ubb38\uc5d0\uc11c\ub294 \uba3c\uc800 \ud559\uc2b5 \ud658\uacbd\ubd80\ud130 \uc81c\ud55c\ud558\uc5ec \uc124\uc815\ud588\ub294\ub370\uc694, (1) 24\uc2dc\uac04 \ub0b4\uc5d0 \ud559\uc2b5\ub420 \uac83, (2) 8\uac1c\uc758 NVIDIA Titan-V GPU (\uac01\uac01 12GB) \ub85c \ud559\uc2b5\uc744 \uc2dc\ub3c4\ud588\ub2e4\uace0 \ud569\ub2c8\ub2e4.
\ucc38\uace0\ub85c, 8\uac1c\uc758 Titan-V GPU\ub85c 24\uc2dc\uac04 \ud559\uc2b5\ud558\ub294 \uac83\uc740 4\uac1c\uc758 RTX 3090 GPU\ub85c \ud558\ub8e8, 40GB\uc9dc\ub9ac 1\uac1c\uc758 A100 GPU\ub85c 2.4\uc77c \ud559\uc2b5\ud55c \uac83\uacfc \uc720\uc0ac\ud558\ub2e4\uace0 \ud558\ub124\uc694 \ud83d\ude42
\ud559\uc2b5 \ub370\uc774\ud130\ub294 \uc601\uc5b4 wikipedia, Toronto BookCorpus\ub85c\ubd80\ud130 \ud68d\ub4dd\ud55c 16GB\uc758 \ud14d\uc2a4\ud2b8 \ub370\uc774\ud130\ub97c \uc774\uc6a9\ud588\ub2e4\uace0 \ud569\ub2c8\ub2e4.<\/p>\n\n\n\n

\ud559\uc2b5\uc740 BERT-style\uc758 transformer encoder\uc640 MLM objective\ub85c \uc9c4\ud589\ud558\uc600\uc2b5\ub2c8\ub2e4.
\ub610\ud55c, sentence classification task\ub97c \ubaa9\uc801\uc73c\ub85c \ud559\uc2b5\ud558\ub294 PLM\uc774\uae30 \ub54c\ubb38\uc5d0, 128\uac1c\ub85c token \uae38\uc774\ub97c \uc81c\ud55c\ud558\uc600\ub294\ub370, \uc774\ub294 BERT\uc758 \uc6d0 \ub17c\ubb38\uc5d0\uc11c\ub3c4 \uc801\uc6a9\ub41c \ubc29\ubc95\uc774\ub77c\uace0 \ud569\ub2c8\ub2e4. (\ud559\uc2b5\uc758 \ucd08\uae30 90%\ub294 127 \ud1a0\ud070\uc73c\ub85c, \ub098\uba38\uc9c0 10%\ub294 512 \ud1a0\ud070\uc73c\ub85c \ud559\uc2b5)
\ud6a8\uacfc\uac00 \ubbf8\ube44\ud55c \uac83\uc73c\ub85c \uc798 \uc54c\ub824\uc9c4 \uac83\uacfc \ub9c8\ucc2c\uac00\uc9c0\ub85c, next sentence prediction (NSP) \ub294 \ud559\uc2b5\uc5d0\uc11c \uc81c\uac70\ud558\uace0 single sentence\ub9cc \ud559\uc2b5\ud588\uc73c\uba70, \ud559\uc2b5 \uc2dc\uac04\uc5d0 \ud3ec\ud568\ub418\ub294 validation loss\ub97c \uacc4\uc0b0\ud558\ub294 \uc2dc\uac04\ub9c8\uc800 \uc904\uc774\uae30 \uc704\ud574, 30\ubd84\ub9c8\ub2e4 0.5%\uc758 validation set\ub9cc\uc744 \uacc4\uc0b0\ud588\ub2e4\uace0 \ud569\ub2c8\ub2e4.
\ubaa8\ub378\uc758 \uc0ac\uc774\uc988\ub294 BERT-large\uc640 \ub3d9\uc77c\ud558\uac8c \uc138\ud305\ud588\uc73c\uba70, DeepSpeed\ub97c \ud1b5\ud574 data parallelization, mixed-precision \uc744 \uc801\uc6a9\ud588\uc2b5\ub2c8\ub2e4.
MLM prediction head\ub97c sparse token prediction\uc73c\ub85c \ubc14\uafb8\uc5c8\uc73c\uba70, APEX LayerNorm\uc744 \uc801\uc6a9\ud568\uc73c\ub85c\uc368 \ud559\uc2b5\uc744 \ucd5c\uc801\ud654\ud588\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n\n

\"\"<\/figure><\/div>\n\n\n\n

\uacb0\ub860\uc801\uc73c\ub85c, \uc774\ub807\uac8c \ucd5c\uc801\ud654 BERT model\uc758 \uacbd\uc6b0, \ub3d9\uc77c\ud55c batch size (bsz)\ub85c \ud559\uc2b5\ud560 \ub54c\ub294 \uae30\uc874 BERT \ub300\ube44 2\ubc30 \uc815\ub3c4 \ube60\ub978 \uc18d\ub3c4\ub85c \ud559\uc2b5\ud588\uace0, batch size\ub97c \ucd5c\ub300\ud55c\uc73c\ub85c \ub298\ub9ac\uc790 2.41\uc77c \ub9cc\uc5d0 \ud559\uc2b5\uc774 \uac00\ub2a5\ud588\ub2e4\uace0 \ud569\ub2c8\ub2e4 \ud83d\ude42<\/p>\n\n\n\n

\"\"<\/figure><\/div>\n\n\n\n

\ub2e8 24\uc2dc\uac04\uc73c\ub85c \ud559\uc2b5\uc744 \uc81c\ud55c\ud55c \uacbd\uc6b0, \uae30\uc874 PLM\uacfc \uc720\uc0ac\ud55c \uc131\ub2a5\uc744 \ubcf4\uc600\ub2e4\uace0 \ud558\ub124\uc694! \ud83d\ude42<\/p>\n\n\n\n

\ud559\uc2b5 \ucd5c\uc801\ud654 \uad00\ub828\ub41c \uae30\uc220\uc740 \uc55e\uc73c\ub85c\ub3c4 \uacc4\uc18d \ubc1c\uc804\uc911\uc785\ub2c8\ub2e4!
\ub098\uc911\uc5d0\ub294 \uac1c\uc778 PC\ub85c GPT-3\ub97c \ud559\uc2b5\ud560 \uc218 \uc788\ub294 \uae30\uc220\ub3c4 \uac00\ub2a5\ud560\uc9c0 \ubaa8\ub974\uaca0\ub124\uc694 \ud83d\ude00<\/p>\n\n\n\n


<\/p>\n

<\/span><\/div>","protected":false},"excerpt":{"rendered":"

[\uc120\ud589\uc5f0\uad6c\ud300 \uae40\uc131\ud604] \uc790\uc5f0\uc5b4\ucc98\ub9ac \ubd84\uc57c\uc5d0\uc11c pre-trained language model (PLM) \uc804\ub7b5\uc774 \ud6cc\ub96d\ud55c \uc131\uacf5\uc744 \uac70\ub450\uc790, \ub354 \ub9ce\uc740 \ub370\uc774\ud130\ub97c \uc774\uc6a9\ud574 \ub354 \ud070 PLM\uc744 \uac1c\ubc1c\ud558\ub294 \uac83\uc774 \ud558\ub098\uc758 \ud2b8\ub79c\ub4dc\ub85c \uc790\ub9ac\uc7a1\uc558\uc2b5\ub2c8\ub2e4.\uadf8\ub9ac\uace0 \uc5bc\ub9c8 \uc804, NVIDIA\uc5d0\uc11c\ub294 GPT-3\uc758 \ubb34\ub824 4\ubc30 \uac00\uae4c\uc774 \ub418\ub294 530B\uac1c\uc758 \ud30c\ub77c\ubbf8\ud130\uc9dc\ub9ac \ubaa8\ub378\uc744 \uacf5\uac1c\ud588\uc2b5\ub2c8\ub2e4.\uc774 \ubaa8\ub378\uc740 \uae30\uc874\uc758 Megatron-LM \ubaa8\ub378\uacfc Turing-NLG \ubaa8\ub378\uc744 \uacb0\ud569\ud558\uc5ec, “Megatron-Turing NLG” (MT-NLG) \ub77c\ub294 \uc774\ub984\uc73c\ub85c \uba85\uba85\ub410\uc2b5\ub2c8\ub2e4.\ubaa8\ub378\uc758 \ud559\uc2b5\uc740 DGX A100 80G \uc11c\ubc84 560\ub300\ub97c…<\/p>\n

<\/span><\/div>","protected":false},"author":1,"featured_media":61024,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"rank_math_lock_modified_date":false,"footnotes":""},"categories":[532,19],"tags":[],"class_list":["post-61023","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-nlp","category-tech04","category-532","category-19","description-off"],"_links":{"self":[{"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/posts\/61023","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/comments?post=61023"}],"version-history":[{"count":1,"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/posts\/61023\/revisions"}],"predecessor-version":[{"id":61027,"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/posts\/61023\/revisions\/61027"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/media\/61024"}],"wp:attachment":[{"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/media?parent=61023"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/categories?post=61023"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/smilegate.ai\/cn\/wp-json\/wp\/v2\/tags?post=61023"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}