Document term matrix in R -

- April 15, 2014

i have following code:

rm(list=ls(all=true)) #clear data setwd("~/ucsb/14 win 15/issy/text.fwt") #set working directory files <- list.files(); head(files) #load & check working directory  fw1 <- scan(what="c", sep="\n",file="fw_chp01.fwt")  library(tm)  corpus2<-corpus(vectorsource(c(fw1))) skipwords<-(function(x) removewords(x, stopwords("english")))  #remove punc, numbers, stopwords, etc funcs<-list(content_transformer(tolower), removepunctuation, removenumbers, stripwhitespace, skipwords) corpus2.proc<-tm_map(corpus2, fun = tm_reduce, tmfuns = funcs)  corpus2a.dtm <- documenttermmatrix(corpus2.proc, control = list(wordlengths = c(1,110))) #create document term matrix

i'm trying use of operations detailed in tm reference manual (http://cran.r-project.org/web/packages/tm/tm.pdf) little success. example, when try use findfreqterms, following error:

error: inherits(x, c("documenttermmatrix", "termdocumentmatrix")) not true

can clue me in why isn't working , can fix it?

edited @lawyer:

head(fw1) produces first 6 lines of text (episode 1 of finnegans wake james joyce):

[1] "003.01    riverrun, past eve , adam's, swerve of shore bend"       [2] "003.02  of bay, brings commodius vicus of recirculation to"     [3] "003.03  howth castle , environs."                                          [4] "003.04    sir tristram, violer d'amores, fr'over short sea, had passen-" [5] "003.05  core rearrived north armorica on side scraggy"         [6] "003.06  isthmus of europe minor wielderfight penisolate war: nor"

inspect(corpus2) outputs each line of text in following format (this final line of text):

[[960]] <<plaintextdocument (metadata: 7)>> 029.36  borough. #this part differs line of course

inspect(corpus2a.dtm) returns table of types (there 4163 in total( in text in following format:

docs  youths yoxen yu yurap yutah zee zephiroth zine zingzang zmorde zoom   1        0     0  0     0     0   0         0    0        0      0    0   2        0     0  0     0     0   0         0    0        0      0    0

here simplified form of provided , did, , tm job. may 1 or more of cleaning steps caused problem.

> library(tm)  > fw1 <- c("riverrun, past eve , adam's, swerve of shore bend       +                                  of bay, brings commodius vicus of recirculation +                                  howth castle , environs.       +                                  sir tristram, violer d'amores, fr'over short sea, had passen- +                                  core rearrived north armorica on side scraggy     +                                  isthmus of europe minor wielderfight penisolate war: nor") >  > corpus<-corpus(vectorsource(c(fw1))) > inspect(corpus) <<vcorpus (documents: 1, metadata (corpus/indexed): 0/0)>>  [[1]] <<plaintextdocument (metadata: 7)>> riverrun, past eve , adam's, swerve of shore bend                                        of bay, brings commodius vicus of recirculation                                  howth castle , environs.                                        sir tristram, violer d'amores, fr'over short sea, had passen-                                  core rearrived north armorica on side scraggy                                      isthmus of europe minor wielderfight penisolate war: nor  > dtm <- documenttermmatrix(corpus) > findfreqterms(dtm)  [1] "adam's,"       "and"           "armorica"      "back"          "bay,"          "bend"           [7] "brings"        "castle"        "commodius"     "core"          "d'amores,"     "environs."     [13] "europe"        "eve"           "fr'over"       "from"          "had"           "his"           [19] "howth"         "isthmus"       "minor"         "nor"           "north"         "passen-"       [25] "past"          "penisolate"    "rearrived"     "recirculation" "riverrun,"     "scraggy"       [31] "sea,"          "shore"         "short"         "side"          "sir"           "swerve"        [37] "the"           "this"          "tristram,"     "vicus"         "violer"        "war:"          [43] "wielderfight"

as point, find useful @ start load few other complementary packages tm.

library(snowballc); library(rweka); library(rjava); library(rwekajars)

for worth, compared complicated cleaning steps, trudge along (replace comments$comment text vector):

comments$comment <- tolower(comments$comment) comments$comment <- removenumbers(comments$comment) comments$comment <- stripwhitespace(comments$comment)  comments$comment <- str_replace_all(comments$comment, "  ", " ")  # replace double spaces internally single space    # better remove punctuation str_ because tm function doesn't insert space library(stringr) comments$comment <- str_replace_all(comments$comment, pattern = "[[:punct:]]", " ")  comments$comment <- removewords(comments$comment, stopwords(kind = "english"))

Search This Blog

Sort

Document term matrix in R -

Comments

Post a Comment

Popular posts from this blog

node.js - Mongoose: Cast to ObjectId failed for value on newly created object after setting the value -

[C++][SFML 2.2] Strange Performance Issues - Moving Mouse Lowers CPU Usage -

ios - Possible to get UIButton sizeThatFits to work? -