twitter - Remove Non-english word from Corpus Using R tm package -

- May 15, 2011

i trying run n-gram on tweets extracted twitter.

in case, want remove non-english in corpus while using packages below:

my code below:

# install , activate packages install.packages("twitter", "rcurl", "rjsonio", "stringr") library(twitter) library(rcurl) library(rjsonio) library(stringr) library("rweka") library("tm")   # declare twitter api credentials api_key <- "xxxx" # dev.twitter.com api_secret <- "xxxx" # dev.twitter.com token <- "xxxx" # dev.twitter.com token_secret <- "xxxx" # dev.twitter.com  # create twitter connection setup_twitter_oauth(api_key, api_secret, token, token_secret)    # run twitter search. format searchtwitter("search terms", n=100, lang="en", geocode="lat,lng", accepts since , until).  tweets <- searchtwitter("'chinese' or 'chinese goverment' or 'china goverment' or 'china economic' or 'chinese people'", n=10000, lang="en")  # transform tweets list data frame tweets.df <- twlisttodf(tweets)  data<-as.data.frame(tweets.df[,1]) colnames(data) colnames(data)[1]<-"text"     data<-corpus(dataframesource(data))   docs<-data     docs <- tm_map(docs, removepunctuation)   # *removing punctuation:*     docs <- tm_map(docs, removenumbers)      # *removing numbers:*     docs <- tm_map(docs, tolower)   # *converting lowercase:*     docs <- tm_map(docs, removewords, stopwords("english"))   # *removing "stopwords"  docs <- tm_map(docs, stemdocument)   # *removing common word endings* (e.g., "ing", "es")    docs <- tm_map(docs, stripwhitespace)   # *stripping whitespace    docs <- tm_map(docs, plaintextdocument)      bigramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 3))     options(header=true, stringasfactors=false,fileencoding="latin1") tdm <- termdocumentmatrix(data, control = list(tokenize = bigramtokenizer))

when run last step,

tdm <- termdocumentmatrix(data, control = list(tokenize = bigramtokenizer))

r gave me error:

> tdm <- termdocumentmatrix(data, control = list(tokenize = bigramtokenizer)) error in .tolower(txt) :    invalid input 'chinese mentality í ½í¸‚' in 'utf8towcs

any 1 know should modify code , remove non-english word?

thanks

Search This Blog

Erty

twitter - Remove Non-english word from Corpus Using R tm package -

Comments

Post a Comment

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

java.lang.NoClassDefFoundError When Creating New Android Project -