我正在尝试找到一种更有效的方法来计算每个案例创建时间开放的案例数.案件在其创建日期/时间戳与其审查日期/时间戳之间是"开放的".您可以复制粘贴下面的代码来查看一个简单的功能示例:
# Create a bunch of date/time stamps for our example two_thousand <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_one <- as.POSIXct("2001-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_two <- as.POSIXct("2002-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_three <- as.POSIXct("2003-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_four <- as.POSIXct("2004-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_five <- as.POSIXct("2005-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_six <- as.POSIXct("2006-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_seven <- as.POSIXct("2007-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_eight <- as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_nine <- as.POSIXct("2009-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_ten <- as.POSIXct("2010-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); two_thousand_eleven <- as.POSIXct("2011-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand <- as.POSIXct("2000-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_one <- as.POSIXct("2001-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_mid_two <- as.POSIXct("2002-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_three <- as.POSIXct("2003-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_four <- as.POSIXct("2004-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_five <- as.POSIXct("2005-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_six <- as.POSIXct("2006-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_seven <- as.POSIXct("2007-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_eight <- as.POSIXct("2008-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_nine <- as.POSIXct("2009-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_ten <- as.POSIXct("2010-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); mid_two_thousand_eleven <- as.POSIXct("2011-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01"); # Create a table that has pairs of created & censored date/time stamps for cases, indicating the range during which each case is "open" comparison_table <- data.table(id = 1:10, created = c(two_thousand, two_thousand_two, two_thousand_four, two_thousand_six, two_thousand_eight, two_thousand_ten, two_thousand, two_thousand_six, two_thousand_three, two_thousand_one), censored = c(two_thousand_one, two_thousand_three, two_thousand_five, two_thousand_seven, two_thousand_nine, two_thousand_eleven, two_thousand_five, two_thousand_ten, two_thousand_eight, two_thousand_four)); # Create a table that has the creation date/time stamps at which we want to count all the open cases check_table <- data.table(id = 1:12, creation = c(mid_two_thousand, mid_two_thousand_one, mid_two_thousand_mid_two, mid_two_thousand_three, mid_two_thousand_four, mid_two_thousand_five, mid_two_thousand_six, mid_two_thousand_seven, mid_two_thousand_eight, mid_two_thousand_nine, mid_two_thousand_ten, mid_two_thousand_eleven)); # I use the DPLYR library as the group_by() + summarize() functions make this operation simple library(dplyr); # Group by id to set parameter for summarize() function check_table_grouped <- group_by(check_table, id); # For each id in the table, sum the number of times that its creation date/time stamp is greater than the creation date/time and less than the censor date/time of all cases in the comparison table # EDIT: Also added timing to compare with method below system.time(check_table_summary <- summarize(check_table_grouped, other_open_values_at_creation_count = sum((comparison_table$created < creation & comparison_table$censored > creation)))); # Result is as desired check_table_summary; # EDIT: Added @David-arenburg's solution with timing library(data.table); setDT(check_table)[, creation2 := creation]; setkey(comparison_table, created, censored); system.time(foverlaps_table <- foverlaps(check_table, comparison_table, by.x = c("creation", "creation2"))[, sum(!is.na(id)), by = i.id]); # Same results as above foverlaps_table;
这种方法适用于本例中的小数据集.然而,即使我使用向量化操作,计算时间也呈指数增长,因为操作计数为:(3*nrow比较)*(nrow sum(nrow)计算).在nrow = 10,000时,时间约为14s,nrow = 100,000,时间> 20分钟.我的实际nrow是~1,000,000.
有更有效的方法来进行此计算吗?我目前正在研究多核选项,但即使这些选项也只能线性减少执行时间.非常感谢您的帮助.谢谢!
编辑:添加@David-arenburg的data.table::foverlaps
解决方案,它也可以工作,并且对于nrow <1000更快.但是,它比summarize
大量行的解决方案慢.在10,000行,它是两倍长.在50,000行,我放弃了等待10倍.有趣的是,该foverlaps
解决方案似乎没有触发自动垃圾收集,因此经常坐在最大RAM(我的系统上64GB),而summarize
解决方案周期性地触发自动垃圾收集,因此永远不会超过~40GB的RAM.我不确定这是否与速度差异有关.
最终编辑:我以一种方式重新编写了问题,使受访者更容易生成具有合适的创建/审查日期时间的大型表.我还简化并更清楚地解释了问题,清楚地表明查找表非常大(违反data.table::foverlaps
假设).我甚至建立了时序比较,使大型案例测试变得非常简单!详细信息:在大型数据集中提交每个案例时计算未结案例的有效方法
再次感谢大家的帮助!:)
又一个foverlaps
解决方案.假设comparison_table
不是太大
library(data.table); setkey(comparison_table, created, censored); times <- sort(unique(c(comparison_table$created, comparison_table$censored))) dt <- data.table(creation=times+1)[, creation2 := creation]; setkey(dt, creation, creation2) x <- foverlaps(comparison_table, dt, by.x = c("created", "censored"))[,.N,creation]$N check_table$newcol <- x[findInterval(check_table$creation, times)]