Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
huggingface
GitHub Repository: huggingface/notebooks
Path: blob/main/transformers_doc/ko/tokenizer_summary.ipynb
4522 views
Kernel: Unknown Kernel
# Transformers ์„ค์น˜ ๋ฐฉ๋ฒ• ! pip install transformers datasets evaluate accelerate # ๋งˆ์ง€๋ง‰ ๋ฆด๋ฆฌ์Šค ๋Œ€์‹  ์†Œ์Šค์—์„œ ์„ค์น˜ํ•˜๋ ค๋ฉด, ์œ„ ๋ช…๋ น์„ ์ฃผ์„์œผ๋กœ ๋ฐ”๊พธ๊ณ  ์•„๋ž˜ ๋ช…๋ น์„ ํ•ด์ œํ•˜์„ธ์š”. # ! pip install git+https://github.com/huggingface/transformers.git

ํ† ํฌ๋‚˜์ด์ € ์š”์•ฝ[[summary-of-the-tokenizers]]

์ด ํŽ˜์ด์ง€์—์„œ๋Š” ํ† ํฐํ™”์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/VFp38yj8h3A?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌํ•˜๊ธฐ ํŠœํ† ๋ฆฌ์–ผ์—์„œ ์‚ดํŽด๋ณธ ๊ฒƒ์ฒ˜๋Ÿผ, ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๊ฒƒ์€ ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด ๋˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋กœ ๋ถ„ํ• ํ•˜๊ณ  ๋ฃฉ์—… ํ…Œ์ด๋ธ”์„ ํ†ตํ•ด id๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ๋‹จ์–ด ๋˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋ฅผ id๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ„๋‹จํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฒˆ ๋ฌธ์„œ์—์„œ๋Š” ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด ๋˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ(์ฆ‰, ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๊ฒƒ)์— ์ค‘์ ์„ ๋‘๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๐Ÿค— Transformers์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ํ† ํฐํ™” ์œ ํ˜•์ธ Byte-Pair Encoding (BPE), WordPiece, SentencePiece๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์–ด๋–ค ๋ชจ๋ธ์—์„œ ์–ด๋–ค ํ† ํฐํ™” ์œ ํ˜•์„ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์˜ˆ์‹œ๋ฅผ ๋ณด์—ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ฐ ๋ชจ๋ธ ํŽ˜์ด์ง€์— ์—ฐ๊ฒฐ๋œ ํ† ํฌ๋‚˜์ด์ €์˜ ๋ฌธ์„œ๋ฅผ ๋ณด๋ฉด ์‚ฌ์ „ ํ›ˆ๋ จ ๋ชจ๋ธ์—์„œ ์–ด๋–ค ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, BertTokenizer๋ฅผ ๋ณด๋ฉด ์ด ๋ชจ๋ธ์ด WordPiece๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”[[introduction]]

ํ…์ŠคํŠธ๋ฅผ ์ž‘์€ ๋ฌถ์Œ(chunk)์œผ๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ์€ ๋ณด๊ธฐ๋ณด๋‹ค ์–ด๋ ค์šด ์ž‘์—…์ด๋ฉฐ, ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, "Don't you love ๐Ÿค— Transformers? We sure do." ๋ผ๋Š” ๋ฌธ์žฅ์„ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/nhJxYji1aho?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

์œ„ ๋ฌธ์žฅ์„ ํ† ํฐํ™”ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ๊ณต๋ฐฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ชผ๊ฐœ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ† ํฐํ™”๋œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

["Don't", "you", "love", "๐Ÿค—", "Transformers?", "We", "sure", "do."]

์ด๋Š” ์ฒซ ๋ฒˆ์งธ ๊ฒฐ๊ณผ๋กœ๋Š” ํ•ฉ๋ฆฌ์ ์ด์ง€๋งŒ, "Transformers?"์™€ "do."ํ† ํฐ์„ ๋ณด๋ฉด ๊ฐ๊ฐ "Transformer"์™€ "do"์— ๊ตฌ๋‘์ ์ด ๋ถ™์–ด์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ๋‘์ ์„ ๊ณ ๋ คํ•ด์•ผ ๋ชจ๋ธ์ด ๋‹จ์–ด์˜ ๋‹ค๋ฅธ ํ‘œํ˜„๊ณผ ๊ทธ ๋’ค์— ์˜ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ๊ตฌ๋‘์ ์„ ํ•™์Šตํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ชจ๋ธ์ด ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” ํ‘œํ˜„์˜ ์ˆ˜๊ฐ€ ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ตฌ๋‘์ ์„ ๊ณ ๋ คํ•œ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

["Don", "'", "t", "you", "love", "๐Ÿค—", "Transformers", "?", "We", "sure", "do", "."]

์ด์ „๋ณด๋‹ค ๋‚˜์•„์กŒ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, "Don't"์˜ ํ† ํฐํ™” ๊ฒฐ๊ณผ๋„ ์ˆ˜์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. "Don't"๋Š” "do not"์˜ ์ค„์ž„๋ง์ด๊ธฐ ๋•Œ๋ฌธ์— ["Do", "n't"]๋กœ ํ† ํฐํ™”๋˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ๋ณต์žกํ•ด์ง€๊ธฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ์ ์ด ๊ฐ ๋ชจ๋ธ๋งˆ๋‹ค ๊ณ ์œ ํ•œ ํ† ํฐํ™” ์œ ํ˜•์ด ์กด์žฌํ•˜๋Š” ์ด์œ  ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค. ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๋ฐ ์ ์šฉํ•˜๋Š” ๊ทœ์น™์— ๋”ฐ๋ผ ๋™์ผํ•œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ํ† ํฐํ™”๋œ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ๊ทœ์น™์œผ๋กœ ํ† ํฐํ™”๋œ ์ž…๋ ฅ์„ ์ œ๊ณตํ•ด์•ผ๋งŒ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

spaCy์™€ Moses๋Š” ์œ ๋ช…ํ•œ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฌ๋‚˜์ด์ €์ž…๋‹ˆ๋‹ค. ์˜ˆ์ œ์— spaCy์™€ Moses ๋ฅผ ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

["Do", "n't", "you", "love", "๐Ÿค—", "Transformers", "?", "We", "sure", "do", "."]

๋ณด์‹œ๋‹ค์‹œํ”ผ ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์  ํ† ํฐํ™”์™€ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”๊ฐ€ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์ , ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”์€ ๋ชจ๋‘ ๋‹จ์–ด ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ์ชผ๊ฐœ๋Š” ๋‹จ์–ด ํ† ํฐํ™”์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ด ํ† ํฐํ™” ๋ฐฉ๋ฒ•์€ ํ…์ŠคํŠธ๋ฅผ ๋” ์ž‘์€ ๋ฌถ์Œ(chunk)๋กœ ๋ถ„ํ• ํ•˜๋Š” ๊ฐ€์žฅ ์ง๊ด€์ ์ธ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ ๋ง๋ญ‰์น˜์— ๋Œ€ํ•ด์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์  ํ† ํฐํ™”๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งค์šฐ ํฐ ์–ดํœ˜(์‚ฌ์šฉ๋œ ๋ชจ๋“  ๊ณ ์œ  ๋‹จ์–ด์™€ ํ† ํฐ ์ง‘ํ•ฉ)์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Transformer XL์€ ๊ณต๋ฐฑ ๋ฐ ๊ตฌ๋‘์  ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•ด ์–ดํœ˜(vocabulary) ํฌ๊ธฐ๊ฐ€ 267,735์ž…๋‹ˆ๋‹ค!

์–ดํœ˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋ฉด ๋ชจ๋ธ์— ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ๋ ˆ์ด์–ด๋กœ ์—„์ฒญ๋‚œ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์ด ํ•„์š”ํ•˜๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก์„ฑ์ด ๋ชจ๋‘ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์€ ์–ดํœ˜ ํฌ๊ธฐ๊ฐ€ 50,000๊ฐœ๋ฅผ ๋„˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋“œ๋ฌผ๋ฉฐ, ํŠนํžˆ ๋‹จ์ผ ์–ธ์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๊ฒฝ์šฐ์—๋Š” ๋”์šฑ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๊ณต๋ฐฑ๊ณผ ๊ตฌ๋‘์  ํ† ํฐํ™”๊ฐ€ ๋งŒ์กฑ์Šค๋Ÿฝ์ง€ ์•Š๋‹ค๋ฉด ๋‹จ์ˆœํžˆ ๋ฌธ์ž๋ฅผ ํ† ํฐํ™”ํ•˜๋ฉด ์–ด๋–จ๊นŒ์š”?

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ssLq_EK2jLE?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

๋ฌธ์ž ํ† ํฐํ™”๋Š” ์•„์ฃผ ๊ฐ„๋‹จํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ์™€ ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ชจ๋ธ์ด ์˜๋ฏธ ์žˆ๋Š” ์ž…๋ ฅ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ธฐ์—๋Š” ํ›จ์”ฌ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌธ์ž "t"์— ๋Œ€ํ•œ ์˜๋ฏธ ์žˆ๋Š” ๋ฌธ๋งฅ ๋…๋ฆฝ์  ํ‘œํ˜„์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ ๋ณด๋‹ค ๋‹จ์–ด "today"์— ๋Œ€ํ•œ ์˜๋ฏธ ์žˆ๋Š” ๋ฌธ๋งฅ ๋…๋ฆฝ์  ํ‘œํ˜„์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ๋” ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋ฌธ์ž ํ† ํฐํ™”๋Š” ์ข…์ข… ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋™๋ฐ˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๊ฐ€์ง€ ์žฅ์ ์„ ๋ชจ๋‘ ์–ป๊ธฐ ์œ„ํ•ด ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ์€ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”๋ผ๊ณ  ํ•˜๋Š” ๋‹จ์–ด ์ˆ˜์ค€๊ณผ ๋ฌธ์ž ์ˆ˜์ค€ ํ† ํฐํ™”์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”[[subword-tokenization]]

#@title from IPython.display import HTML HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/zHvTiHr506c?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด๋Š” ๋” ์ž‘์€ ํ•˜์œ„ ๋‹จ์–ด๋กœ ์ชผ๊ฐœ๊ณ , ๋“œ๋ฌธ ๋‹จ์–ด๋Š” ์˜๋ฏธ ์žˆ๋Š” ํ•˜์œ„ ๋‹จ์–ด๋กœ ๋ถ„ํ•ด๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ์›์น™์— ๋”ฐ๋ผ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "annoyingly"๋Š” ๋“œ๋ฌธ ๋‹จ์–ด๋กœ ๊ฐ„์ฃผ๋˜์–ด "annoying"๊ณผ "ly"๋กœ ๋ถ„ํ•ด๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. "annoyingly"๊ฐ€ "annoying"๊ณผ "ly"์˜ ํ•ฉ์„ฑ์–ด์ธ ๋ฐ˜๋ฉด, "annoying"๊ณผ "ly" ๋‘˜ ๋‹ค ๋…๋ฆฝ์ ์ธ ์„œ๋ธŒ์›Œ๋“œ๋กœ ์ž์ฃผ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ„ฐํ‚ค์–ด์™€ ๊ฐ™์€ ์‘์ง‘์„ฑ ์–ธ์–ด์—์„œ ํŠนํžˆ ์œ ์šฉํ•˜๋ฉฐ, ์„œ๋ธŒ์›Œ๋“œ๋ฅผ ๋ฌถ์–ด ์ž„์˜๋กœ ๊ธด ๋ณตํ•ฉ ๋‹จ์–ด๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ์ด ์˜๋ฏธ ์žˆ๋Š” ๋ฌธ๋งฅ ๋…๋ฆฝ์  ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋ฉด์„œ ํ•ฉ๋ฆฌ์ ์ธ ์–ดํœ˜ ํฌ๊ธฐ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์€ ์ด์ „์— ๋ณธ ์ ์ด ์—†๋Š” ๋‹จ์–ด๋ฅผ ์•Œ๋ ค์ง„ ์„œ๋ธŒ์›Œ๋“œ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, BertTokenizer๋Š” "I have a new GPU!" ๋ผ๋Š” ๋ฌธ์žฅ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค:

from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") tokenizer.tokenize("I have a new GPU!")
["i", "have", "a", "new", "gp", "##u", "!"]

๋Œ€์†Œ๋ฌธ์ž๊ฐ€ ์—†๋Š” ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๋ฌธ์žฅ์˜ ์‹œ์ž‘์ด ์†Œ๋ฌธ์ž๋กœ ํ‘œ๊ธฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์–ด ["i", "have", "a", "new"]๋Š” ํ† ํฌ๋‚˜์ด์ €์˜ ์–ดํœ˜์— ์†ํ•˜์ง€๋งŒ, "gpu"๋Š” ์†ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ํ† ํฌ๋‚˜์ด์ €๋Š” "gpu"๋ฅผ ์•Œ๋ ค์ง„ ๋‘ ๊ฐœ์˜ ์„œ๋ธŒ์›Œ๋“œ๋กœ ์ชผ๊ฐญ๋‹ˆ๋‹ค: ["gp" and "##u"]. "##"์€ ํ† ํฐ์˜ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„์ด ๊ณต๋ฐฑ ์—†์ด ์ด์ „ ํ† ํฐ์— ์—ฐ๊ฒฐ๋˜์–ด์•ผ(attach) ํ•จ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค(ํ† ํฐํ™” ๋””์ฝ”๋”ฉ ๋˜๋Š” ์—ญ์ „์„ ์œ„ํ•ด).

๋˜ ๋‹ค๋ฅธ ์˜ˆ๋กœ, XLNetTokenizer๋Š” ์ด์ „์— ์˜ˆ์‹œ ๋ฌธ์žฅ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ† ํฐํ™”ํ•ฉ๋‹ˆ๋‹ค:

from transformers import XLNetTokenizer tokenizer = XLNetTokenizer.from_pretrained("xlnet/xlnet-base-cased") tokenizer.tokenize("Don't you love ๐Ÿค— Transformers? We sure do.")
["โ–Don", "'", "t", "โ–you", "โ–love", "โ–", "๐Ÿค—", "โ–", "Transform", "ers", "?", "โ–We", "โ–sure", "โ–do", "."]

"โ–"๊ฐ€ ๊ฐ€์ง€๋Š” ์˜๋ฏธ๋Š” SentencePiece์—์„œ ๋‹ค์‹œ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ณด๋‹ค์‹œํ”ผ "Transformers" ๋ผ๋Š” ๋“œ๋ฌธ ๋‹จ์–ด๋Š” ์„œ๋ธŒ์›Œ๋“œ "Transform"์™€ "ers"๋กœ ์ชผ๊ฐœ์ง‘๋‹ˆ๋‹ค.

์ด์ œ ๋‹ค์–‘ํ•œ ํ•˜์œ„ ๋‹จ์–ด ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ•ด๋‹น ๋ชจ๋ธ์ด ํ•™์Šต๋˜๋Š” ๋ง๋ญ‰์น˜์— ๋Œ€ํ•ด ์ˆ˜ํ–‰๋˜๋Š” ์–ด๋–ค ํ˜•ํƒœ์˜ ํ•™์Šต์— ์˜์กดํ•œ๋‹ค๋Š” ์ ์— ์œ ์˜ํ•˜์„ธ์š”.

๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ (Byte-Pair Encoding, BPE)[[bytepair-encoding-bpe]]

๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ(BPE)์€ Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015) ์—์„œ ์†Œ๊ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. BPE๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹จ์–ด๋กœ ๋ถ„ํ• ํ•˜๋Š” ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €(pre-tokenizer)์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ „ ํ† ํฐํ™”(Pretokenization)์—๋Š” GPT-2, Roberta์™€ ๊ฐ™์€ ๊ฐ„๋‹จํ•œ ๊ณต๋ฐฑ ํ† ํฐํ™”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ์‚ฌ์ „ ํ† ํฐํ™”์—๋Š” ๊ทœ์น™ ๊ธฐ๋ฐ˜ ํ† ํฐํ™”๊ฐ€ ํ•ด๋‹นํ•˜๋Š”๋ฐ, ํ›ˆ๋ จ ๋ง๋ญ‰์น˜์—์„œ ๊ฐ ๋‹จ์–ด์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. XLM, ๋Œ€๋ถ€๋ถ„์˜ ์–ธ์–ด์—์„œ Moses๋ฅผ ์‚ฌ์šฉํ•˜๋Š” FlauBERT, Spacy์™€ ftfy๋ฅผ ์‚ฌ์šฉํ•˜๋Š” GPT๊ฐ€ ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์ „ ํ† ํฐํ™” ์ดํ›„์—, ๊ณ ์œ  ๋‹จ์–ด ์ง‘ํ•ฉ๊ฐ€ ์ƒ์„ฑ๋˜๊ณ  ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„๊ฐ€ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์œผ๋กœ, BPE๋Š” ๊ณ ์œ  ๋‹จ์–ด ์ง‘ํ•ฉ์— ๋‚˜ํƒ€๋‚˜๋Š” ๋ชจ๋“  ๊ธฐํ˜ธ๋กœ ๊ตฌ์„ฑ๋œ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๊ธฐ๋ณธ ์–ดํœ˜์˜ ๋‘ ๊ธฐํ˜ธ์—์„œ ์ƒˆ๋กœ์šด ๊ธฐํ˜ธ๋ฅผ ํ˜•์„ฑํ•˜๋Š” ๋ณ‘ํ•ฉ ๊ทœ์น™์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์–ดํœ˜๊ฐ€ ์›ํ•˜๋Š” ์–ดํœ˜ ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์œ„์˜ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ์–ดํœ˜ ํฌ๊ธฐ๋Š” ํ† ํฌ๋‚˜์ด์ €๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์ „์— ์ •์˜ํ•ด์•ผ ํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ผ๋Š” ์ ์„ ์œ ์˜ํ•˜์„ธ์š”.

์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ์ „ ํ† ํฐํ™” ํ›„ ๋นˆ๋„๋ฅผ ํฌํ•จํ•œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์–ดํœ˜ ์ง‘ํ•ฉ์ด ๊ฒฐ์ •๋˜์—ˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)

๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธฐ๋ณธ ์–ดํœ˜๋Š” ["b", "g", "h", "n", "p", "s", "u"] ์ด๊ณ , ๊ฐ ๋‹จ์–ด๋ฅผ ๊ธฐ๋ณธ ์–ดํœ˜์— ์†ํ•˜๋Š” ๊ธฐํ˜ธ๋กœ ์ชผ๊ฐœ๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค:

("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)

๊ทธ๋Ÿฐ ๋‹ค์Œ BPE๋Š” ๊ฐ€๋Šฅํ•œ ๊ฐ ๊ธฐํ˜ธ ์Œ์˜ ๋นˆ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ€์žฅ ์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๊ธฐํ˜ธ ์Œ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์œ„์˜ ์˜ˆ์‹œ์—์„œ "h" ๋’ค์— ์˜ค๋Š” "u"๋Š” 10 + 5 = 15 ๋ฒˆ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ("hug"์—์„œ 10๋ฒˆ, "hugs"์—์„œ 5๋ฒˆ ๋“ฑ์žฅ)

ํ•˜์ง€๋งŒ, ๊ฐ€์žฅ ๋“ฑ์žฅ ๋นˆ๋„๊ฐ€ ๋†’์€ ๊ธฐํ˜ธ ์Œ์€ "u" ๋’ค์— ์˜ค๋Š” "g"์ž…๋‹ˆ๋‹ค. 10 + 5 + 5 = 20 ์œผ๋กœ ์ด 20๋ฒˆ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ๋ณ‘ํ•ฉํ•˜๋Š” ๊ฐ€์žฅ ์ฒซ ๋ฒˆ์งธ ์Œ์€ "u" ๋’ค์— ์˜ค๋Š” "g"์ž…๋‹ˆ๋‹ค. "ug"๊ฐ€ ์–ดํœ˜์— ์ถ”๊ฐ€๋˜์–ด ์–ดํœ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)

BPE๋Š” ๋‹ค์Œ์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š” ๊ธฐํ˜ธ ์Œ์„ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค. "u" ๋’ค์— ์˜ค๋Š” "n"์€ 16๋ฒˆ ๋“ฑ์žฅํ•ด "un" ์œผ๋กœ ๋ณ‘ํ•ฉ๋˜์–ด ์–ดํœ˜์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค. ๊ทธ ๋‹ค์Œ์œผ๋กœ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†“์€ ๊ธฐํ˜ธ ์Œ์€ "h" ๋’ค์— ์˜ค๋Š” "ug"๋กœ 15๋ฒˆ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์‹œ ํ•œ ๋ฒˆ "hug"๋กœ ๋ณ‘ํ•ฉ๋˜์–ด ์–ดํœ˜์— ์ถ”๊ฐ€๋ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ๋‹จ๊ณ„์—์„œ ์–ดํœ˜๋Š” ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"] ์ด๊ณ , ๊ณ ์œ  ๋‹จ์–ด ์ง‘ํ•ฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)

์ด ์‹œ์ ์—์„œ ๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ ํ›ˆ๋ จ์ด ์ค‘๋‹จ๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฉด, ํ›ˆ๋ จ๋œ ๋ณ‘ํ•ฉ ๊ทœ์น™์€ ์ƒˆ๋กœ์šด ๋‹จ์–ด์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค(๊ธฐ๋ณธ ์–ดํœ˜์— ํฌํ•จ๋œ ๊ธฐํ˜ธ๊ฐ€ ์ƒˆ๋กœ์šด ๋‹จ์–ด์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” ํ•œ). ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹จ์–ด "bug"๋Š” ["b", "ug"]๋กœ ํ† ํฐํ™”๋˜์ง€๋งŒ, "m"์ด ๊ธฐ๋ณธ ์–ดํœ˜์— ์—†๊ธฐ ๋•Œ๋ฌธ์— "mug"๋Š” ["<unk>", "ug"]๋กœ ํ† ํฐํ™”๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—๋Š” ๋‹จ์ผ ๋ฌธ์ž๊ฐ€ ์ตœ์†Œํ•œ ํ•œ ๋ฒˆ ๋“ฑ์žฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜์ ์œผ๋กœ "m"๊ณผ ๊ฐ™์€ ๋‹จ์ผ ๋ฌธ์ž๋Š” "<unk>" ๊ธฐํ˜ธ๋กœ ๋Œ€์ฒด๋˜์ง€ ์•Š์ง€๋งŒ, ์ด๋ชจํ‹ฐ์ฝ˜๊ณผ ๊ฐ™์€ ํŠน๋ณ„ํ•œ ๋ฌธ์ž์ธ ๊ฒฝ์šฐ์—๋Š” ๋Œ€์ฒด๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ „์— ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด ์–ดํœ˜ ํฌ๊ธฐ(์ฆ‰ ๊ธฐ๋ณธ ์–ดํœ˜ ํฌ๊ธฐ + ๋ณ‘ํ•ฉ ํšŸ์ˆ˜)๋Š” ์„ ํƒํ•ด์•ผํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด GPT์˜ ๊ธฐ๋ณธ ์–ดํœ˜ ํฌ๊ธฐ๋Š” 478, 40,000๋ฒˆ์˜ ๋ณ‘ํ•ฉ ์ดํ›„์— ํ›ˆ๋ จ์„ ์ข…๋ฃŒํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์–ดํœ˜ ํฌ๊ธฐ๊ฐ€ 40,478์ž…๋‹ˆ๋‹ค.

๋ฐ”์ดํŠธ ์ˆ˜์ค€ BPE (Byte-level BPE)[[bytelevel-bpe]]

๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๊ธฐ๋ณธ ๋ฌธ์ž๋ฅผ ํฌํ•จํ•˜๋Š” ๊ธฐ๋ณธ ์–ดํœ˜์˜ ํฌ๊ธฐ๋Š” ๊ต‰์žฅํžˆ ์ปค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์˜ˆ: ๋ชจ๋“  ์œ ๋‹ˆ์ฝ”๋“œ ๋ฌธ์ž๋ฅผ ๊ธฐ๋ณธ ๋ฌธ์ž๋กœ ๊ฐ„์ฃผํ•˜๋Š” ๊ฒฝ์šฐ) ๋” ๋‚˜์€ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ๊ฐ–๋„๋ก GPT-2๋Š” ๊ธฐ๋ณธ ์–ดํœ˜๋กœ ๋ฐ”์ดํŠธ(bytes)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ชจ๋“  ๊ธฐ๋ณธ ๋ฌธ์ž๊ฐ€ ์–ดํœ˜์— ํฌํ•จ๋˜๋„๋ก ํ•˜๋ฉด์„œ ๊ธฐ๋ณธ ์–ดํœ˜์˜ ํฌ๊ธฐ๋ฅผ 256์œผ๋กœ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ๋‘์ ์„ ๋‹ค๋ฃจ๋Š” ์ถ”๊ฐ€์ ์ธ ๊ทœ์น™์„ ์‚ฌ์šฉํ•ด GPT2 ํ† ํฌ๋‚˜์ด์ €๋Š” ๋ชจ๋“  ํ…์ŠคํŠธ๋ฅผ [removed] ๊ธฐํ˜ธ ์—†์ด ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. GPT-2์˜ ์–ดํœ˜ ํฌ๊ธฐ๋Š” 50,257๋กœ 256 ๋ฐ”์ดํŠธ ํฌ๊ธฐ์˜ ๊ธฐ๋ณธ ํ† ํฐ, ํŠน๋ณ„ํ•œ end-of-text ํ† ํฐ๊ณผ 50,000๋ฒˆ์˜ ๋ณ‘ํ•ฉ์œผ๋กœ ํ•™์Šตํ•œ ๊ธฐํ˜ธ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

์›Œ๋“œํ”ผ์Šค (WordPiece)[[wordpiece]]

์›Œ๋“œํ”ผ์Šค๋Š” BERT, DistilBERT, Electra์— ์‚ฌ์šฉ๋œ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ Japanese and Korean Voice Search (Schuster et al., 2012)์—์„œ ์†Œ๊ฐœ๋˜์—ˆ๊ณ , BPE์™€ ๊ต‰์žฅํžˆ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ์›Œ๋“œํ”ผ์Šค๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋“ฑ์žฅํ•˜๋Š” ๋ชจ๋“  ๋ฌธ์ž๋กœ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ์ดˆ๊ธฐํ™”ํ•œ ํ›„, ์ฃผ์–ด์ง„ ๋ณ‘ํ•ฉ ๊ทœ์น™์— ๋”ฐ๋ผ ์ ์ง„์ ์œผ๋กœ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. BPE์™€๋Š” ๋Œ€์กฐ์ ์œผ๋กœ ์›Œ๋“œํ”ผ์Šค๋Š” ๊ฐ€์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๊ธฐํ˜ธ ์Œ์„ ์„ ํƒํ•˜์ง€ ์•Š๊ณ , ์–ดํœ˜์— ์ถ”๊ฐ€๋˜์—ˆ์„ ๋•Œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์šฐ๋„๊ฐ€ ์ตœ๋Œ€ํ™”๋˜๋Š” ์Œ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์ •ํ™•ํžˆ ๋ฌด์Šจ ์˜๋ฏธ์ผ๊นŒ์š”? ์ด์ „ ์˜ˆ์‹œ๋ฅผ ์ฐธ์กฐํ•˜๋ฉด, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์šฐ๋„ ๊ฐ’์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋“  ๊ธฐํ˜ธ ์Œ ์ค‘์—์„œ ์ฒซ ๋ฒˆ์งธ ๊ธฐํ˜ธ์™€ ๋‘ ๋ฒˆ์งธ ๊ธฐํ˜ธ์˜ ํ™•๋ฅ ๋กœ ๋‚˜๋ˆˆ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ํฐ ๊ธฐํ˜ธ ์Œ์„ ์ฐพ๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด "ug"์˜ ํ™•๋ฅ ์ด "u"์™€ "g" ๊ฐ๊ฐ์œผ๋กœ ์ชผ๊ฐœ์กŒ์„ ๋•Œ ๋ณด๋‹ค ๋†’์•„์•ผ "u" ๋’ค์— ์˜ค๋Š” "g"๋Š” ๋ณ‘ํ•ฉ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ ์›Œ๋“œํ”ผ์Šค๋Š” ๋‘ ๊ธฐํ˜ธ๋ฅผ ๋ณ‘ํ•ฉํ•˜์—ฌ ์žƒ๋Š” ๊ฒƒ์„ ํ‰๊ฐ€ํ•˜์—ฌ ๊ทธ๋งŒํ•œ _๊ฐ€์น˜_๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค๋Š” ์ ์—์„œ BPE์™€ ์•ฝ๊ฐ„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์œ ๋‹ˆ๊ทธ๋žจ (Unigram)[[unigram]]

์œ ๋‹ˆ๊ทธ๋žจ์€ Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Kudo, 2018)์—์„œ ์ œ์•ˆ๋œ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. BPE๋‚˜ ์›Œ๋“œํ”ผ์Šค์™€ ๋‹ฌ๋ฆฌ ์œ ๋‹ˆ๊ทธ๋žจ์€ ๊ธฐ๋ณธ ์–ดํœ˜๋ฅผ ๋งŽ์€ ์ˆ˜์˜ ๊ธฐํ˜ธ๋กœ ์ดˆ๊ธฐํ™”ํ•œ ํ›„ ๊ฐ ๊ธฐํ˜ธ๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ค„์—ฌ ๋” ์ž‘์€ ์–ดํœ˜๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๊ธฐ๋ณธ ์–ดํœ˜๋Š” ๋ชจ๋“  ์‚ฌ์ „ ํ† ํฐํ™”๋œ ๋‹จ์–ด์™€ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ•˜์œ„ ๋ฌธ์ž์—ด์— ํ•ด๋‹นํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ ๋‹ˆ๊ทธ๋žจ์€ transformers ๋ชจ๋ธ์—์„œ ์ง์ ‘์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์ง€๋Š” ์•Š์ง€๋งŒ, SentencePiece์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๊ฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ์œ ๋‹ˆ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ˜„์žฌ ์–ดํœ˜์™€ ์œ ๋‹ˆ๊ทธ๋žจ ์–ธ์–ด ๋ชจ๋ธ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์†์‹ค(ํ”ํžˆ ๋กœ๊ทธ ์šฐ๋„๋กœ ์ •์˜๋จ)์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์–ดํœ˜์˜ ๊ฐ ๊ธฐํ˜ธ์— ๋Œ€ํ•ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•ด๋‹น ๊ธฐํ˜ธ๋ฅผ ์–ดํœ˜์—์„œ ์ œ๊ฑฐํ•  ๊ฒฝ์šฐ ์ „์ฒด ์†์‹ค์ด ์–ผ๋งˆ๋‚˜ ์ฆ๊ฐ€ํ• ์ง€ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„์— ์œ ๋‹ˆ๊ทธ๋žจ์€ ์†์‹ค ์ฆ๊ฐ€์œจ์ด ๊ฐ€์žฅ ๋‚ฎ์€ ๊ธฐํ˜ธ์˜ p(๋ณดํ†ต 10% ๋˜๋Š” 20%) ํผ์„ผํŠธ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค. (์ œ๊ฑฐ๋˜๋Š” ๊ธฐํ˜ธ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ „์ฒด ์†์‹ค์— ๊ฐ€์žฅ ์ž‘์€ ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.) ์–ดํœ˜๊ฐ€ ์›ํ•˜๋Š” ํฌ๊ธฐ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค. ์œ ๋‹ˆ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•ญ์ƒ ๊ธฐ๋ณธ ๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด ์–ด๋–ค ๋‹จ์–ด๋ผ๋„ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ ๋‹ˆ๊ทธ๋žจ์ด ๋ณ‘ํ•ฉ ๊ทœ์น™์— ๊ธฐ๋ฐ˜ํ•˜์ง€ ์•Š๊ธฐ ๋–„๋ฌธ์— (BPE๋‚˜ ์›Œ๋“œํ”ผ์Šค์™€๋Š” ๋Œ€์กฐ์ ์œผ๋กœ), ํ•ด๋‹น ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ›ˆ๋ จ ์ดํ›„์— ์ƒˆ๋กœ์šด ํ…์ŠคํŠธ๋ฅผ ํ† ํฐํ™”ํ•˜๋Š”๋ฐ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ํ›ˆ๋ จ๋œ ์œ ๋‹ˆ๊ทธ๋žจ ํ† ํฐํ™”๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์–ดํœ˜๋ฅผ ๊ฐ€์ง„๋‹ค๋ฉด:

["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"],

"hugs"๋Š” ๋‘ ๊ฐ€์ง€๋กœ ํ† ํฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ["hug", "s"]์™€ ["h", "ug", "s"] ๋˜๋Š” ["h", "u", "g", "s"].

๊ทธ๋ ‡๋‹ค๋ฉด ์–ด๋–ค ํ† ํฐํ™” ๋ฐฉ๋ฒ•์„ ์„ ํƒํ•ด์•ผ ํ• ๊นŒ์š”? ์œ ๋‹ˆ๊ทธ๋žจ์€ ์–ดํœ˜๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ ์™ธ์—๋„ ํ›ˆ๋ จ ๋ง๋ญ‰์น˜์— ๊ฐ ํ† ํฐ์˜ ํ™•๋ฅ ์„ ์ €์žฅํ•˜์—ฌ ํ›ˆ๋ จ ํ›„ ๊ฐ€๋Šฅํ•œ ๊ฐ ํ† ํฐํ™”์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹จ์ˆœํžˆ ์‹ค์ œ๋กœ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ํ† ํฐํ™”๋ฅผ ์„ ํƒํ•˜์ง€๋งŒ, ํ™•๋ฅ ์— ๋”ฐ๋ผ ๊ฐ€๋Šฅํ•œ ํ† ํฐํ™”๋ฅผ ์ƒ˜ํ”Œ๋งํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ™•๋ฅ ์€ ํ† ํฌ๋‚˜์ด์ €๊ฐ€ ํ•™์Šตํ•œ ์†์‹ค์— ์˜ํ•ด ์ •์˜๋ฉ๋‹ˆ๋‹ค.

๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ x1,โ€ฆ,xNx_{1}, \dots, x_{N}๋ผ ํ•˜๊ณ , ๋‹จ์–ด xix_{i}์— ๋Œ€ํ•œ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํ† ํฐํ™” ๊ฒฐ๊ณผ๋ฅผ S(xi)S(x_{i})๋ผ ํ•œ๋‹ค๋ฉด, ์ „์ฒด ์†์‹ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

L=โˆ’โˆ‘i=1Nlogโก(โˆ‘xโˆˆS(xi)p(x))\mathcal{L} = -\sum_{i=1}^{N} \log \left ( \sum_{x \in S(x_{i})} p(x) \right )

์„ผํ…์Šคํ”ผ์Šค (SentencePiece)[[sentencepiece]]

์ง€๊ธˆ๊นŒ์ง€ ๋‹ค๋ฃฌ ํ† ํฐํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋™์ผํ•œ ๋ฌธ์ œ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค: ์ž…๋ ฅ ํ…์ŠคํŠธ๋Š” ๊ณต๋ฐฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ๊ตฌ๋ถ„ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ชจ๋“  ์–ธ์–ด์—์„œ ๋‹จ์–ด๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ๊ณต๋ฐฑ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•œ๊ฐ€์ง€ ๊ฐ€๋Šฅํ•œ ํ•ด๊ฒฐ๋ฐฉ์•ˆ์€ ํŠน์ • ์–ธ์–ด์— ํŠนํ™”๋œ ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด XLM์€ ํŠน์ • ์ค‘๊ตญ์–ด, ์ผ๋ณธ์–ด, ํƒœ๊ตญ์–ด ์‚ฌ์ „ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018)๋Š” ์ž…๋ ฅ์„ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ฒ˜๋ฆฌํ•ด ๊ณต๋ฐฑ๋ฅผ ํ•˜๋‚˜์˜ ๋ฌธ์ž๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„์— BPE ๋˜๋Š” ์œ ๋‹ˆ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•ด ์ ์ ˆํ•œ ์–ดํœ˜๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

XLNetTokenizer๋Š” ์„ผํ…์Šคํ”ผ์Šค๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์œ„์—์„œ ๋‹ค๋ฃฌ ์˜ˆ์‹œ์—์„œ ์–ดํœ˜์— "โ–"๊ฐ€ ํฌํ•จ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋“  ํ† ํฐ์„ ํ•ฉ์นœ ํ›„ "โ–"์„ ๊ณต๋ฐฑ์œผ๋กœ ๋Œ€์ฒดํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์„ผํ…์Šคํ”ผ์Šค๋กœ ํ† ํฐํ™”๋œ ๊ฒฐ๊ณผ๋Š” ๋””์ฝ”๋”ฉํ•˜๊ธฐ ์ˆ˜์›”ํ•ฉ๋‹ˆ๋‹ค.

transformers์—์„œ ์ œ๊ณตํ•˜๋Š” ์„ผํ…์Šคํ”ผ์Šค ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ๋ชจ๋ธ์€ ์œ ๋‹ˆ๊ทธ๋žจ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ALBERT, XLNet, Marian, T5 ๋ชจ๋ธ์ด ์„ผํ…์Šคํ”ผ์Šค ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.