r - Inclusion/Exclusion of rows in a DataFrame, based on specific criteria -
i have large set of data contains pathology test data number of individuals. present scaled down data set describing types of cases.
library(plyr) library(tidyr) library(dplyr) library(lubridate) options(stringsasfactors = false) dat <- structure(list(persid = c("am1", "am2", "am2", "am3", "am3", "am4", "am4", "am4", "am4", "am4", "am4"), sex = c("m", "f","f", "m", "m", "f", "f", "f", "f", "f", "f"), datetested = c("21/10/2015", "9/07/2010", "24/09/2010", "23/10/2013", "25/10/2013", "28/04/2010", "23/06/2010", "21/07/2010", "20/10/2010", "4/03/2011", "2/12/2011"), res = c("nr", "r", "r", "nr", "r", "r", "r", "r", "r", "r", "r"), status = c("yes", "no", "no", "yes", "yes", "no", "no", "no", "no", "no", "no"), dateorder = c(1l, 1l, 2l, 1l, 2l, 1l, 2l, 3l, 4l, 5l, 6l)), .names = c("persid", "sex", "datetested", "res", "status", "dateorder"), class = "data.frame", row.names = c(na, -11l))
the data describes 3 types of person (1)those single result (2) 2 results, , (3) many results.
my goal come script include rows individuals according set of criteria. technically method count rows individuals if subsequent results within specified reinfection period (30 days).
i have converted data list , passed number of functions start processing data.
dat$datetested <- dmy(dat$datetested) datlist <- dlply(.data=dat, .variables=c('persid'))
what have done far is:
select rows there single result per person
fnsingletests <- function(y){ y <- y[length(y$dateorder)==1,] } singletests <- ldply(datlist, fnsingletests, .id = null)
convert data frame list , pass function determines if (a) there 2 rows per person within 30-day reinfection period, select first one, , (b) if there more 2 rows per person, , last record , first record within 30 days, keep first one.
fnmultitests <- function(y){ y <- y[length(y$dateorder) > 1,] } multitests <- llply(datlist, fnmultitests) fnmultitestssplit <- function(y){ test <- difftime(y$datetested[length(y$datetested)], y$datetested[1], units='days') if (nrow(y) <=2){ if (test < 31){ y <- y[y$dateorder == 1, ] y <- y[!is.na(y$perdid), ] } else { y <- y[y$dateorder %in% 1:2, ] y <- y[!is.na(y$persid), ] } } else { if (test < 31){ y <- y[y$dateorder == 1, ] y <- y[!is.na(y$persid), ] } else { break() } } } finaltests <- ldply(multitests, failwith(null, fnmultitestssplit, quiet = true), .id = null)
i can combine data frames rbind:
allfinaltests <- rbind(singletests, finaltests)
where stuck cases there more 2 rows per person, , within sequential rows there may cases of period of time greater 30-day reinfection period.
can suggest how extend code include cases there more 2 persid
, include results there subsequent cases occur outside 30 day reinfection period.
specifically, start oldest case , if next case within 30 days exclude second cases, or if second case more 30 days since previous case, include both cases. should cases same persid
in example final output looking is:
persid sex datetested res status dateorder am1 m 21/10/2015 nr yes 1 am2 f 9/07/2010 r no 1 am2 f 24/09/2010 r no 2 am3 m 23/10/2013 nr yes 1 am4 f 28/04/2010 r no 1 am4 f 23/06/2010 r no 2 am4 f 20/10/2010 r no 4 am4 f 4/03/2011 r no 5 am4 f 2/12/2011 r no 6
in base r, approach follows:
# convert 'datetested' column date-format dat$datetested <- as.date(dat$datetested, format = "%d/%m/%y") # calculate difference in days previous observation in group dat$tdiff <- unlist(tapply(dat$datetested, index = dat$persid, fun = function(x) c(0, `units<-`(diff(x), "days")))) # filter observations have either timedifference of 0 or more dat[(dat[,"tdiff"]==0 | dat[,"tdiff"] > 30),]
which gives:
persid sex datetested res status dateorder tdiff 1 am1 m 2015-10-21 nr yes 1 0 2 am2 f 2010-07-09 r no 1 0 3 am2 f 2010-09-24 r no 2 77 4 am3 m 2013-10-23 nr yes 1 0 6 am4 f 2010-04-28 r no 1 0 7 am4 f 2010-06-23 r no 2 56 9 am4 f 2010-10-20 r no 4 91 10 am4 f 2011-03-04 r no 5 135 11 am4 f 2011-12-02 r no 6 273
using data.table package:
library(data.table) # convert 'data.frame' 'data.table' # , convert 'datetested' column date-format setdt(dat)[, datetested := as.date(datetested, format = "%d/%m/%y")] # calculate difference in days previous observation in group dat[, tdiff := c(0, `units<-`(diff(datetested), "days")), persid] # filter observations have either timedifference of 0 or more 30 days dat[(tdiff==0 | tdiff > 30)]
which give same result. can chain follows:
setdt(dat)[, datetested := as.date(datetested, format = "%d/%m/%y") ][, tdiff := c(0, `units<-`(diff(datetested), "days")), = persid ][(tdiff==0 | tdiff > 30)]
and using dplyr:
library(dplyr) dat %>% mutate(datetested = as.date(datetested, format = "%d/%m/%y")) %>% group_by(persid) %>% mutate(tdiff = c(0, `units<-`(diff(datetested), "days"))) %>% filter(tdiff == 0 | tdiff > 30)
which give same result.
Comments
Post a Comment