python - Document classification in spark mllib -


i want classify documents if belong sports, entertainment, politics. have created bag of words output somthing :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want implement naive bayes algorithm classification using spark mllib. question how convert output can naive bayes use input classifcation rdd or if there trick can convert directly html files can used mllib naive bayes.

for text classification, need:

  • a word dictionary
  • convert document vector using dictionary
  • label document vectors:

    doc_vec1 -> label1

    doc_vec2 -> label2

    ...

this sample pretty straghtforward.


Comments

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

java.lang.NoClassDefFoundError When Creating New Android Project -

Decoding a Python 2 `tempfile` with python-future -