r - Inclusion/Exclusion of rows in a DataFrame, based on specific criteria -


i have large set of data contains pathology test data number of individuals. present scaled down data set describing types of cases.

library(plyr) library(tidyr) library(dplyr) library(lubridate)  options(stringsasfactors = false) dat <- structure(list(persid = c("am1", "am2", "am2", "am3", "am3", "am4", "am4", "am4", "am4", "am4", "am4"), sex = c("m", "f","f", "m", "m", "f", "f", "f", "f", "f", "f"), datetested = c("21/10/2015", "9/07/2010", "24/09/2010", "23/10/2013", "25/10/2013", "28/04/2010", "23/06/2010", "21/07/2010", "20/10/2010", "4/03/2011", "2/12/2011"), res = c("nr", "r", "r", "nr", "r", "r", "r", "r", "r", "r", "r"), status = c("yes", "no", "no", "yes", "yes", "no", "no", "no", "no", "no", "no"), dateorder = c(1l, 1l, 2l, 1l, 2l, 1l, 2l, 3l, 4l, 5l, 6l)), .names = c("persid", "sex", "datetested", "res", "status", "dateorder"), class = "data.frame", row.names = c(na, -11l)) 

the data describes 3 types of person (1)those single result (2) 2 results, , (3) many results.

my goal come script include rows individuals according set of criteria. technically method count rows individuals if subsequent results within specified reinfection period (30 days).

i have converted data list , passed number of functions start processing data.

dat$datetested <- dmy(dat$datetested) datlist <- dlply(.data=dat, .variables=c('persid')) 

what have done far is:

select rows there single result per person

fnsingletests <- function(y){     y <- y[length(y$dateorder)==1,] }  singletests <- ldply(datlist, fnsingletests, .id = null) 

convert data frame list , pass function determines if (a) there 2 rows per person within 30-day reinfection period, select first one, , (b) if there more 2 rows per person, , last record , first record within 30 days, keep first one.

fnmultitests <- function(y){     y <- y[length(y$dateorder) > 1,] }  multitests <- llply(datlist, fnmultitests)  fnmultitestssplit <- function(y){      test <- difftime(y$datetested[length(y$datetested)], y$datetested[1], units='days')       if (nrow(y) <=2){          if (test < 31){             y <- y[y$dateorder == 1, ]             y <- y[!is.na(y$perdid), ]         } else {             y <- y[y$dateorder %in% 1:2, ]             y <- y[!is.na(y$persid), ]         }      } else  {         if (test < 31){             y <- y[y$dateorder == 1, ]             y <- y[!is.na(y$persid), ]         } else {             break()         }      } }  finaltests <-  ldply(multitests, failwith(null, fnmultitestssplit, quiet = true), .id = null) 

i can combine data frames rbind:

allfinaltests <- rbind(singletests, finaltests)

where stuck cases there more 2 rows per person, , within sequential rows there may cases of period of time greater 30-day reinfection period.

can suggest how extend code include cases there more 2 persid , include results there subsequent cases occur outside 30 day reinfection period.

specifically, start oldest case , if next case within 30 days exclude second cases, or if second case more 30 days since previous case, include both cases. should cases same persid

in example final output looking is:

persid  sex datetested  res status  dateorder am1 m   21/10/2015  nr  yes 1 am2 f   9/07/2010   r   no  1 am2 f   24/09/2010  r   no  2 am3 m   23/10/2013  nr  yes 1 am4 f   28/04/2010  r   no  1 am4 f   23/06/2010  r   no  2 am4 f   20/10/2010  r   no  4 am4 f   4/03/2011   r   no  5 am4 f   2/12/2011   r   no  6 

in base r, approach follows:

# convert 'datetested' column date-format dat$datetested <- as.date(dat$datetested, format = "%d/%m/%y") # calculate difference in days previous observation in group dat$tdiff <- unlist(tapply(dat$datetested, index = dat$persid,                            fun = function(x) c(0, `units<-`(diff(x), "days")))) # filter observations have either timedifference of 0 or more  dat[(dat[,"tdiff"]==0 | dat[,"tdiff"] > 30),] 

which gives:

   persid sex datetested res status dateorder tdiff 1     am1   m 2015-10-21  nr    yes         1     0 2     am2   f 2010-07-09   r     no         1     0 3     am2   f 2010-09-24   r     no         2    77 4     am3   m 2013-10-23  nr    yes         1     0 6     am4   f 2010-04-28   r     no         1     0 7     am4   f 2010-06-23   r     no         2    56 9     am4   f 2010-10-20   r     no         4    91 10    am4   f 2011-03-04   r     no         5   135 11    am4   f 2011-12-02   r     no         6   273 

using data.table package:

library(data.table) # convert 'data.frame' 'data.table' # , convert 'datetested' column date-format setdt(dat)[, datetested := as.date(datetested, format = "%d/%m/%y")] # calculate difference in days previous observation in group dat[, tdiff := c(0, `units<-`(diff(datetested), "days")), persid] # filter observations have either timedifference of 0 or more 30 days dat[(tdiff==0 | tdiff > 30)] 

which give same result. can chain follows:

setdt(dat)[, datetested := as.date(datetested, format = "%d/%m/%y")            ][, tdiff := c(0, `units<-`(diff(datetested), "days")), = persid              ][(tdiff==0 | tdiff > 30)] 

and using dplyr:

library(dplyr) dat %>%    mutate(datetested = as.date(datetested, format = "%d/%m/%y")) %>%   group_by(persid) %>%   mutate(tdiff = c(0, `units<-`(diff(datetested), "days"))) %>%   filter(tdiff == 0 | tdiff > 30) 

which give same result.


Comments

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

java.lang.NoClassDefFoundError When Creating New Android Project -

Decoding a Python 2 `tempfile` with python-future -