自動タグ付け機能でも作ってみる - 10: 動かしてみる

動かしてみる

一通り役者は揃ったので、全部つなぎこんでみる。

で、実際Wikipediaのテキストデータから単語の出現頻度データをとって、dumpしてみる。

$ python -m gen_tags --verbose --dicdir /usr/local/lib/mecab/dic/mecab-ipadic-neologd --document-format text --dump ./wikipedia.pickles /var/wikipedia/ja/wiki_*
23:04:45 processing    0/2962(  0.00%): /var/wikipedia/ja/wiki_aaa
23:04:50 processing    1/2962(  0.03%): /var/wikipedia/ja/wiki_aab
23:04:58 processing    2/2962(  0.07%): /var/wikipedia/ja/wiki_aac
23:05:01 processing    3/2962(  0.10%): /var/wikipedia/ja/wiki_aad
23:05:03 processing    4/2962(  0.14%): /var/wikipedia/ja/wiki_aae
:
:
23:54:32 processing 2957/2962( 99.83%): /var/wikipedia/ja/wiki_ejt
23:54:33 processing 2958/2962( 99.86%): /var/wikipedia/ja/wiki_eju
23:54:33 processing 2959/2962( 99.90%): /var/wikipedia/ja/wiki_ejv
23:54:35 processing 2960/2962( 99.93%): /var/wikipedia/ja/wiki_ejw
23:54:35 processing 2961/2962( 99.97%): /var/wikipedia/ja/wiki_ejx

50分程で完了。

今度はこのデータを使って、このブログの記事のキーワードを見つけてみる。

$ python -m gen_tags --dicdir /usr/local/lib/mecab/dic/mecab-ipadic-neologd --restore ./wikipedia.pickles --show-keywords ../hugo/content/**/*.md
Traceback (most recent call last):
  File "/Users/yamada/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/yamada/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/yamada/go/src/github.com/kyoh86/gen-tags/gen_tags/__main__.py", line 74, in <module>
    main()
  File "/Users/yamada/go/src/github.com/kyoh86/gen-tags/gen_tags/__main__.py", line 69, in main
    restore=args.restore,
  File "/Users/yamada/go/src/github.com/kyoh86/gen-tags/gen_tags/core.py", line 56, in run
    for path, words in ranker.weighted_keywords().items():
  File "/Users/yamada/go/src/github.com/kyoh86/gen-tags/gen_tags/keyword_ranker.py", line 92, in weighted_keywords
    (self.stored_doc_count + doc_count) / (len(freq) + self.stored_word_freq[word]))
KeyError: Word(word='kyoh', feature_id='45', weight=0)

怒られてしまった。Wikipedia側に無い単語 (kyoh) で stored_word_freq を参照しようとして落ちている。 そりゃそうなので、[word] ではなく .get(word, 0) で代える。

$ python -m gen_tags --dicdir /usr/local/lib/mecab/dic/mecab-ipadic-neologd --restore ./wikipedia.pickles --show-keywords ../hugo/content/**/*.md
../hugo/content/posts/2015-12-18_17-55-22.md, gopkg, 45,    0.91
../hugo/content/posts/2015-12-18_17-55-22.md, foo, 42,    0.30
../hugo/content/posts/2015-12-18_17-55-22.md, bravo, 41,    0.30
../hugo/content/posts/2015-12-18_17-55-22.md, bar, 45,    0.25
../hugo/content/posts/2015-12-18_17-55-22.md, goimport, 45,    0.23
../hugo/content/posts/2015-12-18_17-55-22.md, xXXXX, 38,    0.23
../hugo/content/posts/2015-12-18_17-55-22.md, hoge, 41,    0.22
../hugo/content/posts/2015-12-18_17-55-22.md, ./..., 4,    0.21
../hugo/content/posts/2015-12-18_17-55-22.md, github, 41,    0.20
../hugo/content/posts/2015-12-18_17-55-22.md, yo., 41,    0.17
../hugo/content/posts/2016-02-01_11-05-36.md, string, 45,    0.65
../hugo/content/posts/2016-02-01_11-05-36.md, FuncMap, 38,    0.45
../hugo/content/posts/2016-02-01_11-05-36.md, Ellipsis, 45,    0.41
../hugo/content/posts/2016-02-01_11-05-36.md, params, 45,    0.41
../hugo/content/posts/2016-02-01_11-05-36.md, 放り込め, 31,    0.41
../hugo/content/posts/2016-02-01_11-05-36.md, func, 45,    0.35
../hugo/content/posts/2016-02-01_11-05-36.md, len, 42,    0.30
../hugo/content/posts/2016-02-01_11-05-36.md, template, 45,    0.28
../hugo/content/posts/2016-02-01_11-05-36.md, 鬼門, 38,    0.26
../hugo/content/posts/2016-02-01_11-05-36.md, string, 38,    0.25
:(中略)
../hugo/content/posts/2016-10-06_08-48-48.md, interface, 41,    0.51
../hugo/content/posts/2016-10-06_08-48-48.md, Impl, 38,    0.37
../hugo/content/posts/2016-10-06_08-48-48.md, スタブ, 38,    0.27
../hugo/content/posts/2016-10-06_08-48-48.md, impl, 45,    0.25
../hugo/content/posts/2016-10-06_08-48-48.md, ctrl, 38,    0.24
../hugo/content/posts/2016-10-06_08-48-48.md, キーバインド, 41,    0.19
../hugo/content/posts/2016-10-06_08-48-48.md, スニペット, 41,    0.18
../hugo/content/posts/2016-10-06_08-48-48.md, gif, 41,    0.18
../hugo/content/posts/2016-10-06_08-48-48.md, Go, 41,    0.11
../hugo/content/posts/2016-10-06_08-48-48.md, go, 45,    0.11
../hugo/content/posts/2016-12-12_00-52-12.md, dotfiles, 45,    0.36
../hugo/content/posts/2016-12-12_00-52-12.md, kyoh, 45,    0.24
../hugo/content/posts/2016-12-12_00-52-12.md, fzf, 45,    0.19
../hugo/content/posts/2016-12-12_00-52-12.md, github, 41,    0.19
../hugo/content/posts/2016-12-12_00-52-12.md, fzf, 38,    0.17
../hugo/content/posts/2016-12-12_00-52-12.md, blob, 41,    0.16
../hugo/content/posts/2016-12-12_00-52-12.md, install, 45,    0.13
../hugo/content/posts/2016-12-12_00-52-12.md, launchctl, 45,    0.13
../hugo/content/posts/2016-12-12_00-52-12.md, launchctl, 38,    0.13
../hugo/content/posts/2016-12-12_00-52-12.md, peco, 42,    0.11
:(中略)
../hugo/content/posts/2016-02-08_14-34-13.md, brew, 45,    0.64
../hugo/content/posts/2016-02-08_14-34-13.md, outdated, 38,    0.58
../hugo/content/posts/2016-02-08_14-34-13.md, upgrade, 41,    0.42
../hugo/content/posts/2016-02-08_14-34-13.md, xargs, 38,    0.26
../hugo/content/posts/2016-02-08_14-34-13.md, Elasticsearch, 41,    0.24
../hugo/content/posts/2016-02-08_14-34-13.md, fzf, 45,    0.24
../hugo/content/posts/2016-02-08_14-34-13.md, まァ, 2,    0.24
../hugo/content/posts/2016-02-08_14-34-13.md, peco, 42,    0.24
../hugo/content/posts/2016-02-08_14-34-13.md, homebrew, 41,    0.23
../hugo/content/posts/2016-02-08_14-34-13.md, Mongo, 45,    0.21
../hugo/content/posts/2016-02-09_01-21-41.md, git, 41,    0.22
../hugo/content/posts/2016-02-09_01-21-41.md, linter, 38,    0.20
../hugo/content/posts/2016-02-09_01-21-41.md, lint, 38,    0.17
../hugo/content/posts/2016-02-09_01-21-41.md, branch, 45,    0.15
../hugo/content/posts/2016-02-09_01-21-41.md, Request, 45,    0.15
../hugo/content/posts/2016-02-09_01-21-41.md, Pull, 38,    0.14
../hugo/content/posts/2016-02-09_01-21-41.md, checkout, 38,    0.14
../hugo/content/posts/2016-02-09_01-21-41.md, プルリクエスト, 38,    0.12
../hugo/content/posts/2016-02-09_01-21-41.md, ですが, 26,    0.10
../hugo/content/posts/2016-02-09_01-21-41.md, new, 38,    0.09
../hugo/content/posts/2016-05-09_22-33-33.md, Qiita, 41,    4.36
../hugo/content/posts/2016-05-09_22-33-33.md, なんで, 34,    0.86
../hugo/content/posts/2016-05-09_22-33-33.md, 愛, 38,    0.05
../hugo/content/posts/2016-05-09_22-33-33.md, 参考, 36,    0.03
../hugo/content/posts/2016-05-09_22-33-33.md, 生まれる, 31,    0.01
../hugo/content/posts/2016-05-09_22-33-33.md, 生まれ, 31,    0.00
../hugo/content/posts/2016-05-09_22-33-33.md, -, 4,    0.00
../hugo/content/posts/2016-05-09_22-33-33.md, の, 63,    0.00
../hugo/content/posts/2016-05-09_22-33-33.md, か, 22,    0.00
../hugo/content/posts/2016-05-09_22-33-33.md, が, 13,    0.00
: (後略)

出たには出たのだけど、精度がかなり悪そうである。

問題点を洗ってみる

  • 名詞以外のキーワードはタグに入れてもしょうがない感じする
  • インラインコードが邪魔

前者は、最後のキーワードを選ぶところで品詞単位で選べばなんとかなりそう。 インラインコードの方はどうしようかかなり悩ましい…