-
fasttext 사용하여 임베딩 하기프로그래밍/자연어처리 2022. 9. 24. 16:14728x90반응형
word embedding으로 word2vec을 주로 사용하다
fasttext가 괜찮다고 해서 사용해 보았다.
word embedding은 비정형화된 text를 숫자로 바꿔주는 방법으로
텍스트를 기계 학습에 사용하기 위해서는 word embedding 과정을 거쳐야한다.
word2vec은 희소한 단어를 임베딩 하기 어렵고
out-of-vocabulary(oov)를 처리하기 어렵다는 단점이 있었는데
fasttext는 단어 단위가 아니라 n-gram을 임베딩함으로써
각 단어는 embedding된 n-gram의 합으로 나타나서 빠르고 좋은 성능을 가진다고 한다.
그리고 oov 문제도 없다고 한다.
fasttext 설치하기
리눅스(ubuntu) 기준으로 아래와 같이 설치를 하였다.
root# git clone https://github.com/facebookresearch/fastTex t.git Cloning into 'fastText'... remote: Enumerating objects: 3930, done. remote: Counting objects: 100% (944/944), done. remote: Compressing objects: 100% (140/140), done. remote: Total 3930 (delta 854), reused 804 (delta 804), pack-reused 2986 Receiving objects: 100% (3930/3930), 8.24 MiB | 9.50 MiB/s, done. Resolving deltas: 100% (2505/2505), done. Checking connectivity... done. root# cd fastText root# pip install . Processing /data/nlp/fastText Requirement already satisfied: pybind11>=2.2 in /root/anaconda3/lib/python3.6/site-packages (from fasttext==0.9.2) (2.10.0) Requirement already satisfied: setuptools>=0.7.0 in /root/anaconda3/lib/python3.6/site-packages (from fasttext==0.9.2) (49.2.0.post20200714) Requirement already satisfied: numpy in /root/anaconda3/lib/python3.6/site-packages (from fasttext==0.9.2) (1.18.5) Building wheels for collected packages: fasttext Building wheel for fasttext (setup.py) ... done Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=2761745 sha256=3b0c637596e035160ce1935e30d7d889b63db9d3985840f724682048beb69114 Stored in directory: /tmp/pip-ephem-wheel-cache-lgobgcyd/wheels/b8/14/ec/33b3b096b9bfd23857e13594d51163f21f9238e6d3f5020081 Successfully built fasttext Installing collected packages: fasttext Successfully installed fasttext-0.9.2
setup 파이선 스크립트를 실행한다.
중간에 자잘한 warning이 있긴 했지만 설치는 되었다.
root# python setup.py install running install running bdist_egg running egg_info creating python/fasttext_module/fasttext.egg-info writing python/fasttext_module/fasttext.egg-info/PKG-INFO writing dependency_links to python/fasttext_module/fasttext.egg-info/dependency_links.txt writing requirements to python/fasttext_module/fasttext.egg-info/requires.txt ... gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/meter.cc -o build/temp.linux-x86_64-3.6/src/meter.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default] gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/quantmatrix.cc -o build/temp.linux-x86_64-3.6/src/quantmatrix.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default] gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/densematrix.cc -o build/temp.linux-x86_64-3.6/src/densematrix.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default] src/densematrix.cc: In member function ‘void fasttext::DenseMatrix::uniform(fasttext::real, unsigned int, int32_t)’: src/densematrix.cc:48:25: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (int i = 0; i < thread; i++) { ^ src/densematrix.cc:51:42: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (int32_t i = 0; i < threads.size(); i++) { ^ gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/fasttext.cc -o build/temp.linux-x86_64-3.6/src/fasttext.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default] src/fasttext.cc: In member function ‘void fasttext::FastText::getWordVector(fasttext::Vector&, const string&) const’: src/fasttext.cc:114:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (int i = 0; i < ngrams.size(); i++) { ^ src/fasttext.cc: In lambda function: src/fasttext.cc:314:15: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (i1 == eosid && i2 == eosid) { // satisfy strict weak ordering ^ src/fasttext.cc:314:30: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (i1 == eosid && i2 == eosid) { // satisfy strict weak ordering ^ src/fasttext.cc:317:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]); ^ src/fasttext.cc:317:37: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]); ^ ... Using /root/anaconda3/lib/python3.6/site-packages Finished processing dependencies for fasttext==0.9.2
모델 다운로드 받고 바로 사용해보았다.
root# python Python 3.6.9 |Anaconda custom (64-bit)| (default, Jul 30 2019, 19:07:31) [GCC 7.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import fasttext >>> import fasttext.util >>> fasttext.util.download_model('ko', if_exists='ignore') Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ko.300.bin.gz (100.00%) [==================================================>] 'cc.ko.300.bin'
모델의 사이즈가 커서 로드 다운로드에 시간이 좀 걸린다
>>> ft.get_word_vector('개발자') array([-2.57197209e-02, -8.81483927e-02, -1.43006742e-01, 5.37867192e-03, 7.99073949e-02, 3.03586433e-03, -4.72261757e-03, 7.85517991e-02, 1.42725622e-02, 3.50356586e-02, -9.85031798e-02, 4.69851159e-02, 8.93128514e-02, 4.01562750e-02, -4.79679815e-02, -7.69715309e-02, 8.23637471e-02, 5.94926625e-02, 6.97196973e-03, 2.36541703e-02, -6.17493242e-02, 1.23428432e-02, 2.14319136e-02, 2.22918461e-04, 1.00414388e-01, 3.23309362e-01, 1.09359816e-01, -5.75466938e-02, 6.19178601e-02, -1.58619303e-02, 5.03552854e-02, -1.07371800e-01, -1.66793153e-01, -6.43139407e-02, 4.10437584e-03, -6.08866140e-02, -3.24534848e-02, 1.17764667e-01, -4.44733202e-02, -5.03154472e-02, 4.62962985e-02, 3.56913060e-02, -5.98100722e-02, 1.04874842e-01, -2.96544703e-03, -4.80531305e-02, 3.01173478e-02, -3.38419876e-03, ... 5.89302741e-03, 2.14233011e-01, -1.72969196e-02, -1.04244053e-01, -4.28575650e-02, -6.91804290e-02, 6.36401698e-02, 9.46133211e-03, -1.07174866e-01, 2.16644872e-02, -3.09087196e-03, 5.51189408e-02], dtype=float32)
기본 임베딩 차원 개수는 300인데 너무 크다면 아래와 같이 수정이 가능하다.
30 벡터로 수정을 해보았다.
>>> fasttext.util.reduce_model(ft, 30) <fasttext.FastText._FastText object at 0x7f7c0e6ac048> >>> ft.get_word_vector('개발자') array([-0.3013211 , -0.10236214, 0.02229641, -0.1734694 , 0.2559858 , -0.01913882, 0.12448859, 0.18616903, 0.0030365 , 0.05462615, -0.18917276, 0.00628146, 0.07219624, 0.08972599, -0.14282568, -0.1585866 , -0.0122357 , -0.17144252, -0.01291392, 0.00192474, 0.00464121, 0.27922162, -0.03845894, -0.09139539, 0.20179468, 0.0161447 , 0.11253469, -0.0628916 , -0.02416351, -0.1148093 ], dtype=float32)
단어와 가장 가까운 단어들도 아래와 같이 출력할 수 있다.
Distributional Hypothesis 기반이라 단어가 문맥상으로 다른 단어지만
문법적으로 근접하거나 비슷하게 자주 나타난다면
비슷한 벡터로 나올 수 있다.
>>> ft.get_nearest_neighbors('개발자') [(0.866005003452301, '디바이스'), (0.8550711870193481, '마케터'), (0.851106584072113, '기획자') , (0.8360496759414673, '프로그래머'), (0.8343381285667419, '어플리케이션'), (0.8333640098571777, '개발자입니다'), (0.8330981731414795, '개발자들'), (0.8309714198112488, '보안전문가'), (0.8266977667808533, 'IT회사'), (0.824160099029541, '개발자나')] >>> ft.get_nearest_neighbors('안드로이드 개발자') [(0.8947457671165466, 'Wikimedia-copyrightwarning'), (0.8938066959381104, '일도양단하는'), (0.8927499651908875, '223.62.160.150'), (0.8902317881584167, '익스플로러버전을'), (0.8891198635101318, '281604'), (0.8890791535377502, 'FC5403'), (0.8879044055938721, '좀여.거주국가에서는'), (0.8853301405906677, 'Compulsory정말'), (0.8852595686912537, 'adidasred'), (0.885100245475769, '사다리도박여자배구중계')] >>> ft.get_nearest_neighbors('CTO') [(0.9098860025405884, 'CIO'), (0.9076651334762573, 'CFO'), (0.9022724032402039, 'COO'), (0.8753493428230286, 'CCO'), (0.8676007986068726, 'CSO'), (0.8645197153091431, 'CTO인'), (0.8573099970817566, '개발팀'), (0.8504807949066162, '총괄부사장'), (0.8462954163551331, 'CMO'), (0.8442021012306213, '최고기술경영자')] >>> ft.get_nearest_neighbors('리더급') [(0.8055601119995117, '학계인사'), (0.7947335243225098, '기업대표'), (0.7912939190864563, '역량있는'), (0.7893251180648804, '부사장급'), (0.7860082983970642, '인력인'), (0.7859938740730286, '상사맨'), (0.7817611694335938, '금융전문가'), (0.7785653471946716, '실무책임자'), (0.7783417701721191, '간부진'), (0.7745304703712463, '석사급')]
참고: https://fasttext.cc/docs/en/support.html
https://fasttext.cc/docs/en/crawl-vectors.html
https://inspiringpeople.github.io/data%20analysis/word_embedding/
728x90반응형'프로그래밍 > 자연어처리' 카테고리의 다른 글
한국어 NER 시작하기 - flair 사용방법 (0) 2024.08.23 azure openai 사용하여 임베딩 변환하기 (0) 2024.08.20 conda: 명령을 찾을 수 없습니다 (0) 2022.09.11 cuda version 확인 (0) 2022.09.11 정규식 online tester - regex101 & preg_match_all (0) 2018.05.05