twitter - Remove Non-english word from Corpus Using R tm package -
i trying run n-gram on tweets extracted twitter.
in case, want remove non-english in corpus while using packages below:
my code below:
# install , activate packages install.packages("twitter", "rcurl", "rjsonio", "stringr") library(twitter) library(rcurl) library(rjsonio) library(stringr) library("rweka") library("tm") # declare twitter api credentials api_key <- "xxxx" # dev.twitter.com api_secret <- "xxxx" # dev.twitter.com token <- "xxxx" # dev.twitter.com token_secret <- "xxxx" # dev.twitter.com # create twitter connection setup_twitter_oauth(api_key, api_secret, token, token_secret) # run twitter search. format searchtwitter("search terms", n=100, lang="en", geocode="lat,lng", accepts since , until). tweets <- searchtwitter("'chinese' or 'chinese goverment' or 'china goverment' or 'china economic' or 'chinese people'", n=10000, lang="en") # transform tweets list data frame tweets.df <- twlisttodf(tweets) data<-as.data.frame(tweets.df[,1]) colnames(data) colnames(data)[1]<-"text" data<-corpus(dataframesource(data)) docs<-data docs <- tm_map(docs, removepunctuation) # *removing punctuation:* docs <- tm_map(docs, removenumbers) # *removing numbers:* docs <- tm_map(docs, tolower) # *converting lowercase:* docs <- tm_map(docs, removewords, stopwords("english")) # *removing "stopwords" docs <- tm_map(docs, stemdocument) # *removing common word endings* (e.g., "ing", "es") docs <- tm_map(docs, stripwhitespace) # *stripping whitespace docs <- tm_map(docs, plaintextdocument) bigramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 3)) options(header=true, stringasfactors=false,fileencoding="latin1") tdm <- termdocumentmatrix(data, control = list(tokenize = bigramtokenizer))
when run last step,
tdm <- termdocumentmatrix(data, control = list(tokenize = bigramtokenizer))
r gave me error:
> tdm <- termdocumentmatrix(data, control = list(tokenize = bigramtokenizer)) error in .tolower(txt) : invalid input 'chinese mentality í ½í¸‚' in 'utf8towcs
any 1 know should modify code , remove non-english word?
thanks
Comments
Post a Comment