Sunday, May 22, 2011

Get UTF-8 string from a memory stream

I’m not sure why but sometimes when you convert deserialized MemoryStream object into an UTF-8 string, you get an unexpected character (namely question mark – '?') at the very first position of the string. I haven't completely understood why it is the case, but managed to find a workaround. Basically this '?' symbol is not actually question mark character. It is a combination of three control bytes, and each of them cannot be displayed separately, but their combination gives '?' symbol.
The bytes are: 239, 187, 191
In order to fix the problem just remove these three bytes from a stream before converting it to string:
Loading...

2 comments:

Ilya Troitskiy said...

This is Byte Order Mark http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark
I think you simple should read string from stream instead of treat it as byte array.

Bashir Magomedov said...

Thanks. I didn't know about BOM, but suspected something like this. Regarding StreamReader, I knew that, but if you are pretty sure that the string is in UTF-8 format I think it is much easier to use single static call than bother with dealing with readers.