2赞

从每个案例的创建时间开始计算开放案例的更有效方法

作者：pan2502851807 | 2023-09-10 15:31

如何解决《从每个案例的创建时间开始计算开放案例的更有效方法》经验，为你挑选了1个好方法。

我正在尝试找到一种更有效的方法来计算每个案例创建时间开放的案例数.案件在其创建日期/时间戳与其审查日期/时间戳之间是"开放的".您可以复制粘贴下面的代码来查看一个简单的功能示例:

# Create a bunch of date/time stamps for our example
two_thousand                <- as.POSIXct("2000-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_one            <- as.POSIXct("2001-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_two            <- as.POSIXct("2002-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_three          <- as.POSIXct("2003-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_four           <- as.POSIXct("2004-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_five           <- as.POSIXct("2005-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_six            <- as.POSIXct("2006-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_seven          <- as.POSIXct("2007-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eight          <- as.POSIXct("2008-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_nine           <- as.POSIXct("2009-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_ten            <- as.POSIXct("2010-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
two_thousand_eleven         <- as.POSIXct("2011-01-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

mid_two_thousand            <- as.POSIXct("2000-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_one        <- as.POSIXct("2001-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_mid_two    <- as.POSIXct("2002-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_three      <- as.POSIXct("2003-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_four       <- as.POSIXct("2004-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_five       <- as.POSIXct("2005-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_six        <- as.POSIXct("2006-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_seven      <- as.POSIXct("2007-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eight      <- as.POSIXct("2008-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_nine       <- as.POSIXct("2009-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_ten        <- as.POSIXct("2010-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");
mid_two_thousand_eleven     <- as.POSIXct("2011-06-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC", origin="1970-01-01");

# Create a table that has pairs of created & censored date/time stamps for cases, indicating the range during which each case is "open"
comparison_table    <- data.table(id        = 1:10,
                                  created   = c(two_thousand, two_thousand_two, two_thousand_four, two_thousand_six, two_thousand_eight, two_thousand_ten, two_thousand, two_thousand_six, two_thousand_three, two_thousand_one),
                                  censored  = c(two_thousand_one, two_thousand_three, two_thousand_five, two_thousand_seven, two_thousand_nine, two_thousand_eleven, two_thousand_five, two_thousand_ten, two_thousand_eight, two_thousand_four));

# Create a table that has the creation date/time stamps at which we want to count all the open cases
check_table         <- data.table(id        = 1:12,
                                  creation  = c(mid_two_thousand, mid_two_thousand_one, mid_two_thousand_mid_two, mid_two_thousand_three, mid_two_thousand_four, mid_two_thousand_five, mid_two_thousand_six, mid_two_thousand_seven, mid_two_thousand_eight, mid_two_thousand_nine, mid_two_thousand_ten, mid_two_thousand_eleven)); 

# I use the DPLYR library as the group_by() + summarize() functions make this operation simple
library(dplyr);

# Group by id to set parameter for summarize() function 
check_table_grouped <- group_by(check_table, id);

# For each id in the table, sum the number of times that its creation date/time stamp is greater than the creation date/time and less than the censor date/time of all cases in the comparison table
# EDIT: Also added timing to compare with method below
system.time(check_table_summary <- summarize(check_table_grouped, other_open_values_at_creation_count = sum((comparison_table$created < creation & comparison_table$censored > creation))));

# Result is as desired
check_table_summary;              

# EDIT: Added @David-arenburg's solution with timing
library(data.table);
setDT(check_table)[, creation2 := creation];
setkey(comparison_table, created, censored);
system.time(foverlaps_table <- foverlaps(check_table, comparison_table, by.x = c("creation", "creation2"))[, sum(!is.na(id)), by = i.id]);

# Same results as above
foverlaps_table;

这种方法适用于本例中的小数据集.然而,即使我使用向量化操作,计算时间也呈指数增长,因为操作计数为:(3*nrow比较)*(nrow sum(nrow)计算).在nrow = 10,000时,时间约为14s,nrow = 100,000,时间> 20分钟.我的实际nrow是~1,000,000.

有更有效的方法来进行此计算吗？我目前正在研究多核选项,但即使这些选项也只能线性减少执行时间.非常感谢您的帮助.谢谢!

编辑:添加@David-arenburg的data.table::foverlaps解决方案,它也可以工作,并且对于nrow <1000更快.但是,它比summarize大量行的解决方案慢.在10,000行,它是两倍长.在50,000行,我放弃了等待10倍.有趣的是,该foverlaps解决方案似乎没有触发自动垃圾收集,因此经常坐在最大RAM(我的系统上64GB),而summarize解决方案周期性地触发自动垃圾收集,因此永远不会超过~40GB的RAM.我不确定这是否与速度差异有关.

最终编辑:我以一种方式重新编写了问题,使受访者更容易生成具有合适的创建/审查日期时间的大型表.我还简化并更清楚地解释了问题,清楚地表明查找表非常大(违反data.table::foverlaps假设).我甚至建立了时序比较,使大型案例测试变得非常简单!详细信息:在大型数据集中提交每个案例时计算未结案例的有效方法

再次感谢大家的帮助!:)

1> Khashaa..：

又一个foverlaps解决方案.假设comparison_table不是太大

library(data.table);
setkey(comparison_table, created, censored);    
times <- sort(unique(c(comparison_table$created, comparison_table$censored)))
dt <- data.table(creation=times+1)[, creation2 := creation];
setkey(dt, creation, creation2)
x <- foverlaps(comparison_table, dt, by.x = c("created", "censored"))[,.N,creation]$N
check_table$newcol <- x[findInterval(check_table$creation, times)]

推荐阅读

程序员
使用步长python列出一个列表项？

如何解决《使用步长python列出一个列表项？》经验，为你挑选了0个好方法。 ... [详细]
程序员
this.value返回包含值的括号

如何解决《this.value返回包含值的括号》经验，为你挑选了1个好方法。 ... [详细]
程序员
调试Spark 1.6.0中的"检测到托管内存泄漏"

如何解决《调试Spark1.6.0中的"检测到托管内存泄漏"》经验，为你挑选了1个好方法。 ... [详细]
程序员
将'$'删除到R中的文本中

如何解决《将'$'删除到R中的文本中》经验，为你挑选了1个好方法。 ... [详细]
程序员
初始化期间无法成功更新网络信息

如何解决《初始化期间无法成功更新网络信息》经验，为你挑选了0个好方法。 ... [详细]
程序员
展平密封的案例类层次结构

如何解决《展平密封的案例类层次结构》经验，为你挑选了0个好方法。 ... [详细]
程序员
pycharm ssh interpter没有这样的文件或目录

如何解决《pycharmsshinterpter没有这样的文件或目录》经验，为你挑选了1个好方法。 ... [详细]
程序员
检查平衡分组字符时在线判断运行时错误

如何解决《检查平衡分组字符时在线判断运行时错误》经验，为你挑选了1个好方法。 ... [详细]
程序员
为什么C++标准为std :: bitset :: reference指定了析构函数？

如何解决《为什么C++标准为std::bitset::reference指定了析构函数？》经验，为你挑选了0个好方法。 ... [详细]
程序员
如何使用JUnit和Hamcrest比较双打？

如何解决《如何使用JUnit和Hamcrest比较双打？》经验，为你挑选了1个好方法。 ... [详细]
程序员
在r中重复一些元素

如何解决《在r中重复一些元素》经验，为你挑选了1个好方法。 ... [详细]
程序员
注释声明中String []的默认值是什么？

如何解决《注释声明中String[]的默认值是什么？》经验，为你挑选了1个好方法。 ... [详细]
程序员
将未知分隔符的.csv加载到Pandas DataFrame中

如何解决《将未知分隔符的.csv加载到PandasDataFrame中》经验，为你挑选了2个好方法。 ... [详细]
程序员
未捕获的ReferenceError:未定义Vue

如何解决《未捕获的ReferenceError:未定义Vue》经验，为你挑选了1个好方法。 ... [详细]
程序员
将数据从SQL导入MongoDB.全有或全无

如何解决《将数据从SQL导入MongoDB.全有或全无》经验，为你挑选了0个好方法。 ... [详细]
程序员
AWS Lambda可以与S/FTP进行交互吗？

如何解决《AWSLambda可以与S/FTP进行交互吗？》经验，为你挑选了0个好方法。 ... [详细]
程序员
从Microsoft CRM获取guid id - javascript控制台

如何解决《从MicrosoftCRM获取guidid-javascript控制台》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何从链接打开应用程序而不要求用户在浏览器或应用程序之间做出决定,只需立即打开我的应用程序

如何解决《如何从链接打开应用程序而不要求用户在浏览器或应用程序之间做出决定,只需立即打开我的应用程序》经验，为你挑选了1个好方法。 ... [详细]
程序员
如何避免Free内部函数破坏函数结果？

如何解决《如何避免Free内部函数破坏函数结果？》经验，为你挑选了1个好方法。 ... [详细]
程序员
PHP命名空间类命名约定

如何解决《PHP命名空间类命名约定》经验，为你挑选了1个好方法。 ... [详细]

pan2502851807

这个屌丝很懒，什么也没留下！

关注作者

Tags | 热门标签

RankList | 热门文章