Document term matrix in R -
i have following code:
rm(list=ls(all=true)) #clear data setwd("~/ucsb/14 win 15/issy/text.fwt") #set working directory files <- list.files(); head(files) #load & check working directory fw1 <- scan(what="c", sep="\n",file="fw_chp01.fwt") library(tm) corpus2<-corpus(vectorsource(c(fw1))) skipwords<-(function(x) removewords(x, stopwords("english"))) #remove punc, numbers, stopwords, etc funcs<-list(content_transformer(tolower), removepunctuation, removenumbers, stripwhitespace, skipwords) corpus2.proc<-tm_map(corpus2, fun = tm_reduce, tmfuns = funcs) corpus2a.dtm <- documenttermmatrix(corpus2.proc, control = list(wordlengths = c(1,110))) #create document term matrix
i'm trying use of operations detailed in tm reference manual (http://cran.r-project.org/web/packages/tm/tm.pdf) little success. example, when try use findfreqterms, following error:
error: inherits(x, c("documenttermmatrix", "termdocumentmatrix")) not true
can clue me in why isn't working , can fix it?
edited @lawyer:
head(fw1) produces first 6 lines of text (episode 1 of finnegans wake james joyce):
[1] "003.01 riverrun, past eve , adam's, swerve of shore bend" [2] "003.02 of bay, brings commodius vicus of recirculation to" [3] "003.03 howth castle , environs." [4] "003.04 sir tristram, violer d'amores, fr'over short sea, had passen-" [5] "003.05 core rearrived north armorica on side scraggy" [6] "003.06 isthmus of europe minor wielderfight penisolate war: nor"
inspect(corpus2) outputs each line of text in following format (this final line of text):
[[960]] <<plaintextdocument (metadata: 7)>> 029.36 borough. #this part differs line of course
inspect(corpus2a.dtm) returns table of types (there 4163 in total( in text in following format:
docs youths yoxen yu yurap yutah zee zephiroth zine zingzang zmorde zoom 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
here simplified form of provided , did, , tm
job. may 1 or more of cleaning steps caused problem.
> library(tm) > fw1 <- c("riverrun, past eve , adam's, swerve of shore bend + of bay, brings commodius vicus of recirculation + howth castle , environs. + sir tristram, violer d'amores, fr'over short sea, had passen- + core rearrived north armorica on side scraggy + isthmus of europe minor wielderfight penisolate war: nor") > > corpus<-corpus(vectorsource(c(fw1))) > inspect(corpus) <<vcorpus (documents: 1, metadata (corpus/indexed): 0/0)>> [[1]] <<plaintextdocument (metadata: 7)>> riverrun, past eve , adam's, swerve of shore bend of bay, brings commodius vicus of recirculation howth castle , environs. sir tristram, violer d'amores, fr'over short sea, had passen- core rearrived north armorica on side scraggy isthmus of europe minor wielderfight penisolate war: nor > dtm <- documenttermmatrix(corpus) > findfreqterms(dtm) [1] "adam's," "and" "armorica" "back" "bay," "bend" [7] "brings" "castle" "commodius" "core" "d'amores," "environs." [13] "europe" "eve" "fr'over" "from" "had" "his" [19] "howth" "isthmus" "minor" "nor" "north" "passen-" [25] "past" "penisolate" "rearrived" "recirculation" "riverrun," "scraggy" [31] "sea," "shore" "short" "side" "sir" "swerve" [37] "the" "this" "tristram," "vicus" "violer" "war:" [43] "wielderfight"
as point, find useful @ start load few other complementary packages tm
.
library(snowballc); library(rweka); library(rjava); library(rwekajars)
for worth, compared complicated cleaning steps, trudge along (replace comments$comment text vector):
comments$comment <- tolower(comments$comment) comments$comment <- removenumbers(comments$comment) comments$comment <- stripwhitespace(comments$comment) comments$comment <- str_replace_all(comments$comment, " ", " ") # replace double spaces internally single space # better remove punctuation str_ because tm function doesn't insert space library(stringr) comments$comment <- str_replace_all(comments$comment, pattern = "[[:punct:]]", " ") comments$comment <- removewords(comments$comment, stopwords(kind = "english"))
Comments
Post a Comment