我们正在开发一个嵌入式Linux系统,使用Live555 WIS-Streamer通过网络在RTSP上传输视频.
在一个特定的系统中,我们看到WIS-Streamer卡在TASK_UNINTERRUPTIBLE状态; 从命令行:ps
进程的状态显示为DW
,WIS进程的子进程都列为Z
ombie状态.
一旦我们处于这种状态,看起来我们无能为力,除了重启(不可取).但是,我们真的很想找到这个的根本原因 - 我怀疑在流send
媒体中它挂在一个阻塞的电话或者某些东西上.有什么我们可以做的,无论是在代码中还是通过命令行等来尝试缩小被阻止的内容?
作为一个例子,我已经尝试查看netstat(netstat -alp
)的输出,看看是否有悬挂套接字附加到被阻塞/僵尸线程的PID,但无济于事.
更新更多信息:
它不会破坏CPU,top
将阻塞和僵尸线程列为0%mem/0%CPU/VSZ 0.
我尝试过关于系统的其他事情:
/ proc/status/for main&child threads 546是父级,被阻止:
$> cat /proc/546/stat Name: wis-streamer State: D (disk sleep) Tgid: 546 Pid: 546 PPid: 1 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 0 Groups: Threads: 1 SigQ: 17/353 SigPnd: 0000000000000000 ShdPnd: 0000000000004102 SigBlk: 0000000000000000 SigIgn: 0000000000001004 SigCgt: 0000000180006a02 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: 1 Cpus_allowed_list: 0 voluntary_ctxt_switches: 997329 nonvoluntary_ctxt_switches: 2428751
儿童:
Name: wis-streamer State: Z (zombie) Tgid: 581 Pid: 581 PPid: 546 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 0 Groups: Threads: 1 SigQ: 17/353 SigPnd: 0000000000000000 ShdPnd: 0000000000000102 SigBlk: 0000000000000000 SigIgn: 0000000000001004 SigCgt: 0000000180006a02 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: 1 Cpus_allowed_list: 0 voluntary_ctxt_switches: 856676 nonvoluntary_ctxt_switches: 15626 Name: wis-streamer State: Z (zombie) Tgid: 582 Pid: 582 PPid: 546 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 0 Groups: Threads: 1 SigQ: 17/353 SigPnd: 0000000000000000 ShdPnd: 0000000000000102 SigBlk: 0000000000000000 SigIgn: 0000000000001004 SigCgt: 0000000180006a02 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: 1 Cpus_allowed_list: 0 voluntary_ctxt_switches: 856441 nonvoluntary_ctxt_switches: 15694 Name: wis-streamer State: Z (zombie) Tgid: 583 Pid: 583 PPid: 546 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 0 Groups: Threads: 1 SigQ: 17/353 SigPnd: 0000000000000000 ShdPnd: 0000000000000102 SigBlk: 0000000000000000 SigIgn: 0000000000001004 SigCgt: 0000000180006a02 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: 1 Cpus_allowed_list: 0 voluntary_ctxt_switches: 856422 nonvoluntary_ctxt_switches: 15837 Name: wis-streamer State: Z (zombie) Tgid: 584 Pid: 584 PPid: 546 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 0 Groups: Threads: 1 SigQ: 17/353 SigPnd: 0000000000000000 ShdPnd: 0000000000000102 SigBlk: 0000000000000000 SigIgn: 0000000000001004 SigCgt: 0000000180006a02 CapInh: 0000000000000000 CapPrm: ffffffffffffffff CapEff: ffffffffffffffff CapBnd: ffffffffffffffff Cpus_allowed: 1 Cpus_allowed_list: 0 voluntary_ctxt_switches: 856339 nonvoluntary_ctxt_switches: 15500
来自/proc/
filesys的其他内容:
$> cat /proc/546/personality 00c00000 $> cat /proc/546/stat 546 (wis-streamer) D 1 453 453 0 -1 4194564 391 0 135 0 140098 232409 0 0 20 0 1 0 1094 0 0 4294967295 0 0 0 0 0 0 0 4100 27138 3223605768 0 0 17 0 0 0 0 0 0
更新时更新:
我有一种感觉,SysV-IPC消息队列或信号量调用可能会挂起 - 我们的系统由进程间消息队列保持在一起(至少40%没有在这里发明,由Elbonian Code Slaves编写,作为可怕的一部分)可怕的SDK),可以陷阱不小心.我已经重新设置了几个信号量获取/释放程序,我怀疑这些程序不完全是水上运动(实际上可能只是防松鼠)而且会关注事物 - 不幸的是它平均需要12个小时才能运行导致此故障的特定测试设置.
从sysrq的文档:
'w' - 转储处于不间断(阻塞)状态的任务.
echo w >/proc/sysrq-trigger
显示有关控制台上被阻止任务的大量信息(也应该可以查看dmesg
); 特别是内核堆栈跟踪有助于解决问题.