ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • fasttext 사용하여 임베딩 하기
    프로그래밍/자연어처리 2022. 9. 24. 16:14
    반응형

     

     

    word embedding으로 word2vec을 주로 사용하다

    fasttext가 괜찮다고 해서 사용해 보았다. 

     

    word embedding은 비정형화된 text를 숫자로 바꿔주는 방법으로 

    텍스트를 기계 학습에 사용하기 위해서는 word embedding 과정을 거쳐야한다. 

     

    word2vec은 희소한 단어를 임베딩 하기 어렵고 

    out-of-vocabulary(oov)를 처리하기 어렵다는 단점이 있었는데 

    fasttext는 단어 단위가 아니라 n-gram을 임베딩함으로써 

    각 단어는 embedding된 n-gram의 합으로 나타나서 빠르고 좋은 성능을 가진다고 한다. 

     

    그리고 oov 문제도 없다고 한다. 

     

     

     

    fasttext 설치하기 

     

    리눅스(ubuntu) 기준으로 아래와 같이 설치를 하였다. 

    root# git clone https://github.com/facebookresearch/fastTex                                                                                                                                                             t.git
    Cloning into 'fastText'...
    remote: Enumerating objects: 3930, done.
    remote: Counting objects: 100% (944/944), done.
    remote: Compressing objects: 100% (140/140), done.
    remote: Total 3930 (delta 854), reused 804 (delta 804), pack-reused 2986
    Receiving objects: 100% (3930/3930), 8.24 MiB | 9.50 MiB/s, done.
    Resolving deltas: 100% (2505/2505), done.
    Checking connectivity... done.
    root# cd fastText
    
    root#  pip install .
    Processing /data/nlp/fastText
    Requirement already satisfied: pybind11>=2.2 in /root/anaconda3/lib/python3.6/site-packages (from fasttext==0.9.2) (2.10.0)
    Requirement already satisfied: setuptools>=0.7.0 in /root/anaconda3/lib/python3.6/site-packages (from fasttext==0.9.2) (49.2.0.post20200714)
    Requirement already satisfied: numpy in /root/anaconda3/lib/python3.6/site-packages (from fasttext==0.9.2) (1.18.5)
    Building wheels for collected packages: fasttext
      Building wheel for fasttext (setup.py) ... done
      Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=2761745 sha256=3b0c637596e035160ce1935e30d7d889b63db9d3985840f724682048beb69114
      Stored in directory: /tmp/pip-ephem-wheel-cache-lgobgcyd/wheels/b8/14/ec/33b3b096b9bfd23857e13594d51163f21f9238e6d3f5020081
    Successfully built fasttext
    Installing collected packages: fasttext
    Successfully installed fasttext-0.9.2

     

     

    setup 파이선 스크립트를 실행한다.

    중간에 자잘한 warning이 있긴 했지만 설치는 되었다. 

     

    root# python setup.py install
    running install
    running bdist_egg
    running egg_info
    creating python/fasttext_module/fasttext.egg-info
    writing python/fasttext_module/fasttext.egg-info/PKG-INFO
    writing dependency_links to python/fasttext_module/fasttext.egg-info/dependency_links.txt
    writing requirements to python/fasttext_module/fasttext.egg-info/requires.txt
    
    
    ...
    
    gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/meter.cc -o build/temp.linux-x86_64-3.6/src/meter.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
    gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/quantmatrix.cc -o build/temp.linux-x86_64-3.6/src/quantmatrix.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
    gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/densematrix.cc -o build/temp.linux-x86_64-3.6/src/densematrix.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
    src/densematrix.cc: In member function ‘void fasttext::DenseMatrix::uniform(fasttext::real, unsigned int, int32_t)’:
    src/densematrix.cc:48:25: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         for (int i = 0; i < thread; i++) {
                             ^
    src/densematrix.cc:51:42: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         for (int32_t i = 0; i < threads.size(); i++) {
                                              ^
    gcc -pthread -B /root/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -I/root/anaconda3/lib/python3.6/site-packages/pybind11/include -Isrc -I/root/anaconda3/include/python3.6m -c src/fasttext.cc -o build/temp.linux-x86_64-3.6/src/fasttext.o -DVERSION_INFO="0.9.2" -std=c++11 -fvisibility=hidden
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
    src/fasttext.cc: In member function ‘void fasttext::FastText::getWordVector(fasttext::Vector&, const string&) const’:
    src/fasttext.cc:114:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
       for (int i = 0; i < ngrams.size(); i++) {
                                       ^
    src/fasttext.cc: In lambda function:
    src/fasttext.cc:314:15: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         if (i1 == eosid && i2 == eosid) { // satisfy strict weak ordering
                   ^
    src/fasttext.cc:314:30: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         if (i1 == eosid && i2 == eosid) { // satisfy strict weak ordering
                                  ^
    src/fasttext.cc:317:21: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
                         ^
    src/fasttext.cc:317:37: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
         return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
                                         ^
    
    ...
    
    Using /root/anaconda3/lib/python3.6/site-packages
    Finished processing dependencies for fasttext==0.9.2

     

    모델 다운로드 받고 바로 사용해보았다. 

     

     

    root# python
    Python 3.6.9 |Anaconda custom (64-bit)| (default, Jul 30 2019, 19:07:31)
    [GCC 7.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import fasttext
    >>> import fasttext.util
    >>> fasttext.util.download_model('ko', if_exists='ignore')
    Downloading https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ko.300.bin.gz
     (100.00%) [==================================================>]
    'cc.ko.300.bin'

     

    모델의 사이즈가 커서 로드 다운로드에 시간이 좀 걸린다 

     

    >>> ft.get_word_vector('개발자')
    array([-2.57197209e-02, -8.81483927e-02, -1.43006742e-01,  5.37867192e-03,
            7.99073949e-02,  3.03586433e-03, -4.72261757e-03,  7.85517991e-02,
            1.42725622e-02,  3.50356586e-02, -9.85031798e-02,  4.69851159e-02,
            8.93128514e-02,  4.01562750e-02, -4.79679815e-02, -7.69715309e-02,
            8.23637471e-02,  5.94926625e-02,  6.97196973e-03,  2.36541703e-02,
           -6.17493242e-02,  1.23428432e-02,  2.14319136e-02,  2.22918461e-04,
            1.00414388e-01,  3.23309362e-01,  1.09359816e-01, -5.75466938e-02,
            6.19178601e-02, -1.58619303e-02,  5.03552854e-02, -1.07371800e-01,
           -1.66793153e-01, -6.43139407e-02,  4.10437584e-03, -6.08866140e-02,
           -3.24534848e-02,  1.17764667e-01, -4.44733202e-02, -5.03154472e-02,
            4.62962985e-02,  3.56913060e-02, -5.98100722e-02,  1.04874842e-01,
           -2.96544703e-03, -4.80531305e-02,  3.01173478e-02, -3.38419876e-03,
    ...
        5.89302741e-03,  2.14233011e-01, -1.72969196e-02, -1.04244053e-01,
           -4.28575650e-02, -6.91804290e-02,  6.36401698e-02,  9.46133211e-03,
           -1.07174866e-01,  2.16644872e-02, -3.09087196e-03,  5.51189408e-02],
          dtype=float32)

     

    기본 임베딩 차원 개수는 300인데 너무 크다면 아래와 같이 수정이 가능하다. 

    30 벡터로 수정을 해보았다. 

     

    >>> fasttext.util.reduce_model(ft, 30)
    <fasttext.FastText._FastText object at 0x7f7c0e6ac048>
    >>> ft.get_word_vector('개발자')
    array([-0.3013211 , -0.10236214,  0.02229641, -0.1734694 ,  0.2559858 ,
           -0.01913882,  0.12448859,  0.18616903,  0.0030365 ,  0.05462615,
           -0.18917276,  0.00628146,  0.07219624,  0.08972599, -0.14282568,
           -0.1585866 , -0.0122357 , -0.17144252, -0.01291392,  0.00192474,
            0.00464121,  0.27922162, -0.03845894, -0.09139539,  0.20179468,
            0.0161447 ,  0.11253469, -0.0628916 , -0.02416351, -0.1148093 ],
          dtype=float32)

     

    단어와 가장 가까운 단어들도 아래와 같이 출력할 수 있다. 

    Distributional Hypothesis 기반이라 단어가 문맥상으로 다른 단어지만

    문법적으로 근접하거나 비슷하게 자주 나타난다면 

    비슷한 벡터로 나올 수 있다. 

     

    >>> ft.get_nearest_neighbors('개발자')
    [(0.866005003452301, '디바이스'), (0.8550711870193481, '마케터'), (0.851106584072113, '기획자')
    , (0.8360496759414673, '프로그래머'), (0.8343381285667419, '어플리케이션'), 
    (0.8333640098571777, '개발자입니다'), (0.8330981731414795, '개발자들'), 
    (0.8309714198112488, '보안전문가'), (0.8266977667808533, 'IT회사'), 
    (0.824160099029541, '개발자나')]
    
    >>> ft.get_nearest_neighbors('안드로이드 개발자')
    [(0.8947457671165466, 'Wikimedia-copyrightwarning'), (0.8938066959381104, '일도양단하는'), 
    (0.8927499651908875, '223.62.160.150'), (0.8902317881584167, '익스플로러버전을'), 
    (0.8891198635101318, '281604'), (0.8890791535377502, 'FC5403'), 
    (0.8879044055938721, '좀여.거주국가에서는'), (0.8853301405906677, 'Compulsory정말'), 
    (0.8852595686912537, 'adidasred'), (0.885100245475769, '사다리도박여자배구중계')]
    
    >>> ft.get_nearest_neighbors('CTO')
    [(0.9098860025405884, 'CIO'), (0.9076651334762573, 'CFO'), (0.9022724032402039, 'COO'), 
    (0.8753493428230286, 'CCO'), (0.8676007986068726, 'CSO'), (0.8645197153091431, 'CTO인'), 
    (0.8573099970817566, '개발팀'), (0.8504807949066162, '총괄부사장'), (0.8462954163551331, 'CMO'),
    (0.8442021012306213, '최고기술경영자')]
    
    >>> ft.get_nearest_neighbors('리더급')
    [(0.8055601119995117, '학계인사'), (0.7947335243225098, '기업대표'),
    (0.7912939190864563, '역량있는'), (0.7893251180648804, '부사장급'), 
    (0.7860082983970642, '인력인'), (0.7859938740730286, '상사맨'), 
    (0.7817611694335938, '금융전문가'), (0.7785653471946716, '실무책임자'), 
    (0.7783417701721191, '간부진'), (0.7745304703712463, '석사급')]

     

     

     

     

    참고: https://fasttext.cc/docs/en/support.html

    https://fasttext.cc/docs/en/crawl-vectors.html

    https://inspiringpeople.github.io/data%20analysis/word_embedding/

    728x90
    반응형
Designed by Tistory.