python - Document classification in spark mllib -

i want classify documents if belong sports, entertainment, politics. have created bag of words output somthing :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want implement naive bayes algorithm classification using spark mllib. question how convert output can naive bayes use input classifcation rdd or if there trick can convert directly html files can used mllib naive bayes.

for text classification, need:

  • a word dictionary
  • convert document vector using dictionary
  • label document vectors:

    doc_vec1 -> label1

    doc_vec2 -> label2


this sample pretty straghtforward.


Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

java.lang.NoClassDefFoundError When Creating New Android Project -