2赞

在某个位置或更低位置计算设置位的有效方法是什么？

作者：sx-March23 | 2023-09-09 15:14

如何解决《在某个位置或更低位置计算设置位的有效方法是什么？》经验，为你挑选了4个好方法。

给定std::bitset<64> bits任意数量的位和位位置X(0-63)

在X位或更低位计数位的最有效方法是什么,如果未设置X位,则返回0

注意:如果设置该位,则返回始终至少为1

蛮力方式很慢:

int countupto(std::bitset<64> bits, int X)
{
  if (!bits[X]) return 0;
  int total=1;
  for (int i=0; i < X; ++i)
  {
    total+=bits[i];
  }
  return total;
}

这个count()方法bitset将为您popcount提供所有位,但bitset不支持范围

注意:这不是如何计算32位整数中的设置位数？因为它询问所有位而不是0到X的范围

1> Peter Cordes..：

This C++ gets g++ to emit very good x86 ASM (godbolt compiler explorer). I expect it will compile efficiently on other 64bit architectures, too (if there's a HW popcount for std::bitset::count to use, otherwise that'll always be the slow part):

#include 

int popcount_subset(std::bitset<64> A, int pos) {
  int high_bits_to_eliminate = 63 - pos;
  A <<= (high_bits_to_eliminate & 63);  // puts A[pos] at A[63].

  return (A[63]? ~0ULL : 0) & A.count();  // most efficient way: great code with gcc and clang
  // see the godbolt link for some #ifdefs with other ways to do the check, like
    // return A[BSET_SIZE-1] ? A.count() : 0;
}

This probably isn't optimal on 32bit architectures, so compare other alternatives if you need to make a 32bit build.

This will work for other sizes of bitset, as long as you do something about the hard-coded 63s, and change the & 63 mask for the shift count into a more general range-check. For optimal performance with strange size bitsets, make a template function with a specialization for size <= register width of the target machine. In that case, extract the bitset to an unsigned type of the appropriate width, and shift to the top of the register instead of the top of the bitset.

You'd expect this to also generate ideal code for bitset<32>, but it doesn't quite. gcc/clang still use 64bit registers on x86-64.

For large bitsets, shifting the whole thing will be slower than just popcounting the words below the one containing pos, and using this on that word. (This is where a vectorized popcount really shines on x86 if you can assume SSSE3 but not the popcnt insn hardware support, or for 32bit targets. AVX2 256bit pshufb is the fastest way to do bulk popcounts, but without AVX2 I think 64bit popcnt is pretty close to a 128-bit pshufb implementation. See the comments for more discussion.)

如果你有一个64位元素的数组,并且想要分别计算每个元素中某个位置以下的位,那么你肯定应该使用SIMD.该算法的移位部分矢量化,而不仅仅是popcnt部分.psadbw在pshufb基于a 的popcnt 之后使用全零寄存器到64位块中的水平和字节,该popcnt分别产生每个字节中的位的计数.SSE/AVX没有64位算术右移,但您可以使用不同的技术来混合每个元素的高位.

我怎么想到这个:

您希望编译器输出的asm指令将:

从64位值中删除不需要的位

测试所需位的最高位.

popcount它.

返回0或popcount,具体取决于测试结果.(无分支或分支实现都有优势.如果分支是可预测的,则无分支实现往往会更慢.)

最明显的方式做1是产生一个面具((1<<(pos+1)) -1)和&它.一种更有效的方法是左移63-pos,将您想要的位打包在寄存器的顶部.

这也有一个有趣的副作用,即将要测试的位作为寄存器中的最高位.测试符号位而不是任何其他任意位需要稍微少一些指令.算术右移可以将符号位广播到寄存器的其余部分,从而允许比通常更无效的无分支代码.

做popcount是一个备受争议的问题,但实际上是这个难题的棘手部分.在x86上,它有非常高效的硬件支持,但仅限于最近足够的硬件.在Intel CPU上,该popcnt指令仅适用于Nehalem和更新版本.当AMD增加支持时,我忘记了.

因此,为了安全地使用它,您需要使用不使用的回退进行CPU调度popcnt.或者,制作单独的二进制文件,这些二进制文件不依赖于某些CPU功能.

没有popcnt指令的popcount 可以通过几种方式完成.一个使用SSSE3 pshufb来实现一个4位LUT.但是,当在整个阵列上使用时,这是最有效的,而不是一次只使用一个64b.标量bithacks在这里可能是最好的,并且不需要SSSE3(因此将与具有64位但不具有pshufb的古老AMD CPU兼容.)

Bitbroadcast:

(A[63]? ~0ULL : 0)要求编译器将高位广播到所有其他位位置,允许它用作AND掩码以使popcount结果归零(或不归零).请注意,即使对于大的bitset大小,它仍然只屏蔽输出popcnt,而不是bitset本身,所以~0ULL很好我使用ULL来确保不要求编译器仅将位广播到寄存器的低32b(与UL在Windows,例如).

This broadcast can be done with an arithmetic right shift by 63, which shifts in copies of the high bit.

clang generated this code from the original version. After some prodding from Glenn about different implementations for 4, I realized that I could lead gcc towards clang's optimal solution by writing the source more like the ASM I want. The obvious ((int64_t)something) >> 63 to more directly request an arithmetic right shift would not be strictly portable, because signed right-shifts are implementation-defined as either arithmetic or logical. The standard doesn't provide any portable arithmetic right-shift operator. (It's not undefined behaviour, though.) Anyway, fortunately compilers are smart enough: gcc sees the best way once you give it enough of a hint.

This source makes great code on x86-64 and ARM64 with gcc and clang. Both simply use an arithmetic right shift on the input to popcnt (so the shift can run in parallel with the popcnt). It also compiles great on 32bit x86 with gcc, because the masking only happens to a 32bit variable (after multiple popcnt results are added). It's the rest of the function that's nasty on 32bit (when the bitset is larger than a register).

Original ternary-operator version with gcc

Compiled with gcc 5.3.0 -O3 -march=nehalem -mtune=haswell (older gcc, like 4.9.2, also still emit this):

; the original ternary-operator version.  See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
    ; input bitset in rdi, input count in esi (SysV ABI)
    mov     ecx, esi    ; x86 variable-count shift requires the count in cl
    xor     edx, edx    ; edx=0 
    xor     eax, eax    ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
    not     ecx         ; two's complement bithack for 63-pos (in the low bits of the register)
    sal     rdi, cl     ; rdi << ((63-pos) & 63);  same insn as shl (arithmetic == logical left shift)
    popcnt  rdx, rdi
    test    rdi, rdi    ; sets SF if the high bit is set.
    cmovs   rax, rdx    ; conditional-move on the sign flag
    ret

See How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results? for background on gcc's use of the -x == ~x + 1 two's complement identity. (And Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted? which tangentially mentions that shl masks the shift count, so we only need the low 6 bits of ecx to hold 63 - pos. Mostly linking that because I wrote it recently and anyone still reading this paragraph might find it interesting.)

Some of those instructions will go away when inlining. (e.g. gcc would generate the count in ecx in the first place.)

With Glenn's multiply instead of ternary operator idea (enabled by USE_mul), gcc does

    shr     rdi, 63
    imul    eax, edi

at the end instead of xor/test/cmovs.

Haswell perf analysis, using microarch data from Agner Fog (Multiply version):

mov r,r: 1 fused-domain uop, 0 latency, no execution unit

xor-zeroing: 1 fused-domain uop, no execution unit

not: 1 uop for p0/p1/p5/p6, 1c latency, 1 per 0.25c throughput

shl (aka sal) with count in cl: 3 uops for p0/p6: 2c latency, 1 per 2c throughput. (Agner Fog's data indicates that IvyBridge only takes 2 uops for this, strangely.)

popcnt: 1 uop for p1, 3c latency, 1 per 1c throughput

shr r,imm: 1 uop for p0/p6, 1c latency. 1 per 0.5c throughput.

imul r,r: 1uop for p1, 3c latency.

not counting the ret

Totals:

9 fused-domain uops, can issue in 2.25 cycles (in theory; uop cache-line effects usually bottleneck the frontend slightly).

4 uops (shifts) for p0/p6. 2 uops for p1. 1 any-ALU-port uop. Can execute at one per 2c (saturating the shift ports), so the frontend is the worst bottleneck.

Latency: Critical path from when the bitset is ready to when the result is: shl(2) -> popcnt(3) -> imul(3). Total 8 cycles. Or 9c from when pos is ready, because the not is an extra 1c latency for it.

The optimal bitbroadcast version replaces shr with sar (same perf), and imul with and (1c latency instead of 3c, runs on any port). So the only perf change is reducing the critical path latency to 6 cycles. Throughput is still bottlenecked on the frontend. and being able to run on any port doesn't make a difference, unless you're mixing this with code that bottlenecks on port1 (instead of looking at the throughput for running just this code in a tight loop).

cmov (ternary operator) version: 11 fused-domain uops (frontend: one per 2.75c). execution units: still bottlenecked on the shift ports (p0/p6) at one per 2c. Latency: 7c from bitset to result, 8c from pos to result. (cmov is 2c latency, 2 uops for any of p0/p1/p5/p6.)

Clang has some different tricks up its sleeve: Instead of test/cmovs, it generates a mask of either all-ones or all-zeros by using an arithmetic right-shift to broadcast the sign bit to all positions of a register. I love it: Using and instead of cmov is more efficient on Intel. It still has the data-dependency and does the work for both sides of the branch (which is the main downside to cmov in general), though. Update: with the right source code, gcc will use this method, too.

clang 3.7 -O3 -Wall -march=nehalem -mtune=haswell

popcount_subset(std::bitset<64ul>, int):
    mov     ecx, 63
    sub     ecx, esi      ; larger code size, but faster on CPUs without mov-elimination
    shl     rdi, cl       ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi      ; doesn't start a fresh dep chain before this, like gcc does
    sar     rdi, 63       ; broadcast the sign bit
    and     eax, edi      ; eax = 0 or its previous value
    ret

sar / and replaces xor / test / cmov, and cmov is a 2-uop instruction on Intel CPUs, so that's really nice. (For the ternary-operator version).

Clang still does the sar / and trick instead of an actual imul when using the multiply source version, or the "bitbroadcast" source version. So those help gcc without hurting clang. (sar/and is definitely better than shr/imul: 2c less latency on the critical path.) The pow_of_two_sub version does hurt clang (see the first godbolt link: omitted from this answer to avoid clutter with ideas that didn't pan out).

The mov ecx, 63/sub ecx, esi is actually faster on CPUs without mov-elimination for reg,reg moves (zero latency and no execution port, handled by register renaming). This includes Intel pre-IvyBridge, but not more recent Intel and AMD CPUs.

Clang's mov imm/sub method puts only one cycle of latency for pos onto the critical path (beyond the bitset->result latency), instead of two for a mov ecx, esi/not ecx on CPUs where mov r,r has 1c latency.

With BMI2 (Haswell and later), an optimal ASM version can save a mov to ecx. Everything else works the same, because shlx masks its shift-count input register down to the operand-size, just like shl.

x86 shift instructions have crazy CISC semantics where if the shift count is zero, the flags aren't affected. So variable-count shift instructions have a (potential) dependency on the old value of the flags. "Normal" x86 shl r, cl decodes to 3 uops on Haswell, but BMI2 shlx r, r, r is only 1. So it's too bad that gcc still emits sal with -march=haswell, instead of using shlx (which it does use in some other cases).

// hand-tuned BMI2 version using the NOT trick and the bitbroadcast
popcount_subset(std::bitset<64ul>, int):
    not     esi           ; The low 6 bits hold 63-pos.  gcc's two-s complement trick
    xor     eax, eax      ; break false dependency on Intel.  maybe not needed when inlined.
    shlx    rdi, rdi, rsi ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi
    sar     rdi, 63       ; broadcast the sign bit: rdi=0 or -1
    and     eax, edi      ; eax = 0 or its previous value
    ret

Perf analysis for Intel Haswell: 6 fused-domain uops (frontend: one per 1.5c). Execution units: 2 p0/p6 shift uops. 1 p1 uop. 2 any-port uops: (one per 1.25c from total execution port limits). Critical path latency: shlx(1) -> popcnt(3) -> and(1) = 5c bitset->result. (or 6c from pos->result).

Note that when inlining, a human (or smart compiler) could avoid the need for the xor eax, eax. It's only there because of popcnt's false dependency on the output register (on Intel), and we need the output in eax (which the caller may have used recently for a long dep chain). With -mtune=bdver2 or something, gcc won't zero the register it's going to use for popcnt output.

When inlining, we could use an output register that already has to be ready at least as early as popcnt's source reg to avoid the problem. Compilers will do an in-place popcnt rdi,rdi when the source isn't needed later, but that's not the case here. Instead, we can pick another register that already has to be ready before the source. popcnt's input depends on 63-pos, and we can clobber it, so popcnt rsi,rdi's dependency on rsi can't delay it. Or if we had 63 in a register, we could popcnt rsi,rdi/sarx rax, rsi, reg_63/and eax, esi. Or BMI2 3-operand shift instructions would also let us not clobber inputs in case they're needed afterwards.

This is so light-weight that loop overhead and setting up the input operands/storing the results are going to be major factors. (And the 63-pos can optimize away with a compile-time constant, or into wherever a variable count comes from.)

The Intel compiler amusingly shoots itself in the foot and doesn't take advantage of the fact that A[63] is the sign bit. shl/bt rdi, 63/jc. It even sets up the branches in a really dumb way. It could zero eax, and then jump over popcnt or not based on the sign flag set by shl.

An optimal branching implementation, starting from ICC13 output from -O3 -march=corei7 on godbolt:

   // hand-tuned, not compiler output
        mov       ecx, esi    ; ICC uses neg/add/mov :/
        not       ecx
        xor       eax, eax    ; breaks the false dep, or is the return value in the taken-branch case
        shl       rdi, cl
        jns    .bit_not_set
        popcnt    rax, rdi
.bit_not_set:
        ret

That's pretty much optimal: The A[pos] == true case has one not-taken branch. It doesn't save very much over the branchless method, though.

If the A[pos] == false case is more common: jump over a ret instruction, to a popcnt/ret. (Or after inlining: jump to a block at the end that does the popcnt and jumps back).

2> Jerry Coffin..：

我的立即反应是测试指定的位,并立即返回0清楚.

如果超过该值,则使用该位(以及不太重要的位)设置位掩码,并and使用原始输入创建位掩码.然后使用count()member函数获取结果中设置的位数.

至于创建蒙版:你可以向左移动1个位置,然后减去1.

3> 1201ProgramA..：

假设一个unsigned long或unsigned long long大到足以容纳64位,您可以调用bits.to_unlong()(或bits.to_ullong())将bitset数据作为整数,屏蔽掉X((1 << X) - 1)上方的位,然后计算您链接到的问题的答案中给出的那些位.

4> ShadowRanger..：

对于位下面的位,很容易在位和掩码之间进行转换,所以这样的东西应该可以工作:

int popcnt(bitset<64> bs, int x) {
    // Early out when bit not set
    if (!bs[x]) return 0;
    // Otherwise, make mask from `x`, mask and count bits
    return (bs & bitset<64>((1UL << x) - 1)).count() + 1;
}

这里的假设是bitset::count有效实施(使用popcnt内在函数或有效的回退); 这不能保证,但STL人员倾向于优化这类事情.

在某个位置或更低位置计算设置位的有效方法是什么？

我怎么想到这个:

Bitbroadcast:

Haswell perf analysis, using microarch data from Agner Fog (Multiply version):

有没有办法截断字段数据

如何在vim中将命令的输出打印到新窗口

'缺少PFX或证书+私钥.' 在https socket.io中

如何让kestrel Web服务器监听非localhost请求？

更改ui.bootrap的工具提示箭头的颜色(附带的plunker)

Stanford NER Tagger在NLTK

Phaser:如何在预加载后加载资产？

键入表示非空的字符串或F#中的空格

如何正确使用dataLayer.push（）更新数组中嵌套对象的值？

无法使用bootstrap-sass与webpack一起工作

Django上的用户角色架构

使用Multibranch Workflow清理构建

如何告诉Gradle始终使用--console = plain？

在C++ 11中使用shared_from_this()而不使用托管共享指针

在git rebase期间找到分支名称

从C++ std :: vector中的线程中启动可运行对象

Android Studio更新破坏了我的模拟器

Scikit Learn Multilabel分类:ValueError:您似乎正在使用传统的多标签数据表示

无法在AsyncTask中访问"findViewById"

Haskell从文件中读取一个Map