python - Document classification in spark mllib -


i want classify documents if belong sports, entertainment, politics. have created bag of words output somthing :

(1, 'saurashtra') (1, 'saumyajit') (1, 'satyendra')

i want implement naive bayes algorithm classification using spark mllib. question how convert output can naive bayes use input classifcation rdd or if there trick can convert directly html files can used mllib naive bayes.

for text classification, need:

  • a word dictionary
  • convert document vector using dictionary
  • label document vectors:

    doc_vec1 -> label1

    doc_vec2 -> label2

    ...

this sample pretty straghtforward.


Comments

Popular posts from this blog

c - How to retrieve a variable from the Apache configuration inside the module? -

c# - Constructor arguments cannot be passed for interface mocks -

python - malformed header from script index.py Bad header -