下面是我当前的char*到十六进制字符串函数.我把它写成一个位操作练习.在AMD Athlon MP 2800+上花费大约7毫秒来对1000万字节阵列进行取消.我缺少任何技巧或其他方式吗?
我怎样才能让它更快?
用-O3以g ++编译
static const char _hex2asciiU_value[256][2] = { {'0','0'}, {'0','1'}, /* snip..., */ {'F','E'},{'F','F'} }; std::string char_to_hex( const unsigned char* _pArray, unsigned int _len ) { std::string str; str.resize(_len*2); char* pszHex = &str[0]; const unsigned char* pEnd = _pArray + _len; clock_t stick, etick; stick = clock(); for( const unsigned char* pChar = _pArray; pChar != pEnd; pChar++, pszHex += 2 ) { pszHex[0] = _hex2asciiU_value[*pChar][0]; pszHex[1] = _hex2asciiU_value[*pChar][1]; } etick = clock(); std::cout << "ticks to hexify " << etick - stick << std::endl; return str; }
更新
添加了时间码
Brian R. Bondy:将std :: string替换为堆alloc'd缓冲区并将*16更改为ofs << 4 - 但堆分配缓冲区似乎会降低它的速度? - 结果~11ms
AnttiSykäri:用.替换内环
int upper = *pChar >> 4; int lower = *pChar & 0x0f; pszHex[0] = pHex[upper]; pszHex[1] = pHex[lower];
结果~8ms
罗伯特:_hex2asciiU_value
用一个完整的256条表替换,牺牲内存空间但结果大约7毫秒!
HoyHoy:注意到它产生的结果不正确
这个汇编函数(基于我之前的帖子,但我不得不修改一下这个概念以使其实际工作)在Core 2 Conroe 3Ghz的一个核心上处理每秒33亿输入字符(66亿输出字符).Penryn可能更快.
%include "x86inc.asm" SECTION_RODATA pb_f0: times 16 db 0xf0 pb_0f: times 16 db 0x0f pb_hex: db 48,49,50,51,52,53,54,55,56,57,65,66,67,68,69,70 SECTION .text ; int convert_string_to_hex( char *input, char *output, int len ) cglobal _convert_string_to_hex,3,3 movdqa xmm6, [pb_f0 GLOBAL] movdqa xmm7, [pb_0f GLOBAL] .loop: movdqa xmm5, [pb_hex GLOBAL] movdqa xmm4, [pb_hex GLOBAL] movq xmm0, [r0+r2-8] movq xmm2, [r0+r2-16] movq xmm1, xmm0 movq xmm3, xmm2 pand xmm0, xmm6 ;high bits pand xmm2, xmm6 psrlq xmm0, 4 psrlq xmm2, 4 pand xmm1, xmm7 ;low bits pand xmm3, xmm7 punpcklbw xmm0, xmm1 punpcklbw xmm2, xmm3 pshufb xmm4, xmm0 pshufb xmm5, xmm2 movdqa [r1+r2*2-16], xmm4 movdqa [r1+r2*2-32], xmm5 sub r2, 16 jg .loop REP_RET
请注意,它使用x264汇编语法,这使它更具可移植性(从32位到64位等).要将其转换为您选择的语法是微不足道的:r0,r1,r2是寄存器中函数的三个参数.它有点像伪代码.或者您可以从x264树中获取common/x86/x86inc.asm并包含它以便本机运行它.
PS Stack Overflow,我是不是因为浪费时间在这么微不足道的事情上?或者这太棒了?