最近,有人询问了一种在C中反转字符串的算法.在处理非单字节字符串时,大多数提议的解决方案都存在问题.所以,我想知道什么是专门处理utf-8字符串的好算法.
我提出了一些代码,我将其作为答案发布,但我很高兴看到其他人的想法或建议.我更喜欢使用实际代码,因此我选择了C#,因为它似乎是本网站中最流行的语言之一,但我不介意你的代码是否是另一种语言,只要它可以合理任何熟悉命令式语言的人都能理解.而且,由于这是为了看看这样的算法如何在低级别实现(通过低级别我只是意味着处理字节),我们的想法是避免使用库来实现核心代码.
笔记:
我对算法本身,它的性能以及它如何进行优化感兴趣(我的意思是算法级优化,而不是用++ i替换i ++等等;我对实际的基准测试也不感兴趣).
我并不是要在生产代码中实际使用它或"重新发明轮子".这只是出于好奇和锻炼.
我正在使用C#字节数组,所以我假设您可以获得字符串的长度,而不会运行字符串直到找到NUL.也就是说,我没有考虑到找到字符串长度的复杂性.但是,如果你正在使用C语言,你可以在调用核心代码之前使用strlen()来解决这个问题.
编辑:
正如Mike F指出的那样,我的代码(以及此处发布的其他人的代码)并不处理复合字符.那些有些信息在这里.我不熟悉这个概念,但是如果这意味着存在"组合字符",即只与其他"基本"字符/代码点组合有效的字符/代码点,那么这样的查找表当反转时,字符可用于保持"全局"字符("基本"+"组合"字符)的顺序.
我将一次通过反转字节,然后第二次通过将任何多字节字符(在UTF8中很容易检测到)中的字节反转回正确的顺序.
你绝对可以在一次通过中处理这个问题,但除非例程成为瓶颈,否则我不会打扰.
此代码假定输入的UTF-8字符串有效且格式正确(即每个多字节字符最多4个字节):
#include "string.h" void utf8rev(char *str) { /* this assumes that str is valid UTF-8 */ char *scanl, *scanr, *scanr2, c; /* first reverse the string */ for (scanl= str, scanr= str + strlen(str); scanl < scanr;) c= *scanl, *scanl++= *--scanr, *scanr= c; /* then scan all bytes and reverse each multibyte character */ for (scanl= scanr= str; c= *scanr++;) { if ( (c & 0x80) == 0) // ASCII char scanl= scanr; else if ( (c & 0xc0) == 0xc0 ) { // start of multibyte scanr2= scanr; switch (scanr - scanl) { case 4: c= *scanl, *scanl++= *--scanr, *scanr= c; // fallthrough case 3: // fallthrough case 2: c= *scanl, *scanl++= *--scanr, *scanr= c; } scanr= scanl= scanr2; } } } // quick and dirty main for testing purposes #include "stdio.h" int main(int argc, char* argv[]) { char buffer[256]; buffer[sizeof(buffer)-1]= '\0'; while (--argc > 0) { strncpy(buffer, argv[argc], sizeof(buffer)-1); // don't overwrite final null printf("%s ? ", buffer); utf8rev(buffer); printf("%s\n", buffer); } return 0; }
如果编译此程序(示例名称so199260.c
:)并在UTF-8环境(在本例中为Linux安装)上运行它:
$ so199260 ???? ??? ???? français ???? a????b a????b ? b????a ???? ? ???? français ? siaçnarf ???? ? ???? ??? ? ??? ???? ? ????
如果代码太含糊不清,我会很乐意澄清.
同意您的方法是在现场进行的唯一合理的方法.
就个人而言,我不喜欢在处理它的每个函数中重新验证UTF8,并且通常只做必要的事情来避免崩溃; 它增加了很多代码.Dunno很多C#所以这里是C:
(编辑以消除strlen)
void reverse( char *start, char *end ) { while( start < end ) { char c = *start; *start++ = *end; *end-- = c; } } char *reverse_char( char *start ) { char *end = start; while( (end[1] & 0xC0) == 0x80 ) end++; reverse( start, end ); return( end+1 ); } void reverse_string( char *string ) { char *end = string; while( *end ) end = reverse_char( end ); reverse( string, end-1 ); }
我最初的方法可以这样总结:
1)天真地反向字节
2)向后运行字符串并随时修复utf8序列.
在第二步中处理非法序列,在第一步中,我们检查字符串是否处于"同步"状态(即,如果它以合法的前导字节开始).
编辑:改进Reverse()中前导字节的验证
class UTF8Utils { public static void Reverse(byte[] str) { int len = str.Length; int i = 0; int j = len - 1; // first, check if the string is "synced", i.e., it starts // with a valid leading character. Will check for illegal // sequences thru the whole string later. byte leadChar = str[0]; // if it starts with 10xx xxx, it's a trailing char... // if it starts with 1111 10xx or 1111 110x // it's out of the 4 bytes range. // EDIT: added validation for 7 bytes seq and 0xff if( (leadChar & 0xc0) == 0x80 || (leadChar & 0xfc) == 0xf8 || (leadChar & 0xfe) == 0xfc || (leadChar & 0xff) == 0xfe || leadChar == 0xff) { throw new Exception("Illegal UTF-8 sequence"); } // reverse bytes in-place naïvely while(i < j) { byte tmp = str[i]; str[i] = str[j]; str[j] = tmp; i++; j--; } // now, run the string again to fix the multibyte sequences UTF8Utils.ReverseMbSequences(str); } private static void ReverseMbSequences(byte[] str) { int i = str.Length - 1; byte leadChar = 0; int nBytes = 0; // loop backwards thru the reversed buffer while(i >= 0) { // since the first byte in the unreversed buffer is assumed to be // the leading char of that byte, it seems safe to assume that the // last byte is now the leading char. (Given that the string is // not out of sync -- we checked that out already) leadChar = str[i]; // check how many bytes this sequence takes and validate against // illegal sequences if(leadChar < 0x80) { nBytes = 1; } else if((leadChar & 0xe0) == 0xc0) { if((str[i-1] & 0xc0) != 0x80) { throw new Exception("Illegal UTF-8 sequence"); } nBytes = 2; } else if ((leadChar & 0xf0) == 0xe0) { if((str[i-1] & 0xc0) != 0x80 || (str[i-2] & 0xc0) != 0x80 ) { throw new Exception("Illegal UTF-8 sequence"); } nBytes = 3; } else if ((leadChar & 0xf8) == 0xf0) { if((str[i-1] & 0xc0) != 0x80 || (str[i-2] & 0xc0) != 0x80 || (str[i-3] & 0xc0) != 0x80 ) { throw new Exception("Illegal UTF-8 sequence"); } nBytes = 4; } else { throw new Exception("Illegal UTF-8 sequence"); } // now, reverse the current sequence and then continue // whith the next one int back = i; int front = back - nBytes + 1; while(front < back) { byte tmp = str[front]; str[front] = str[back]; str[back] = tmp; front++; back--; } i -= nBytes; } } }