Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
One can really never get enough of puns about the BOM (Byte Order Mark) and TSA.
And when I say one, I mean I. :-)
Just think back to blogs like Don't sneak a BOM in on someone who promises to ignore free space or Everyone seems averse to the BOM these days; Should we blame TSA? :-) or How to get yourself imprisoned [by/for talking about Unicode].
See what I mean?
I was reminded of this when Pritam asked:
Is there any tool or code available to verify Byte Order Mark signature in XML files?
Of course sniffing out a few bytes is easy enough. Abhinaba provided the full chart of valid BOM values:
Bytes
Encoding Form
00 00 FE FF
UTF-32, big-endian
FF FE 00 00
UTF-32, little-endian
FE FF
UTF-16, big-endian
FF FE
UTF-16, little-endian
EF BB BF
UTF-8
Easy, right?
Okay, anyone want to make a try at writing the minimal code BOM detector?
Think of it as a way to play your part in airport security!
Points awarded for clearest, or for most concise, or for briefest, or for most clever, or for the sake of maintainability, most smart.
If you can write something able to handle other, non-standard byte orderings of data, then you probably went to Cal Tech! :-)
This post brought to you by U+feff, aka ZERO WIDTH NO-BREAK SPACE)
wait...which plane? BMP? SMP? SIP? TIP? SSP?
Sorry, I have nothing else useful to contribute here, though I'm a little surprised this problem isn't solved already...?!?
See http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html for a formal English description of what you have to do to play in the Appendix F leagues.
Mike, you're the only person I know of who pronounces "bee-oh-em" as a word. Cakemakers and codebreakers have every right to say "bomb/bombe" but not we I18Ners.
Dude, you lead a sheltered life. In Unicode and related standards circles, in i18n conversations with developers at Adobe, Apple, IBM, Google, and Microsoft -- it is pronounced as a single word all the time....
Here's my approach:
enum BOM {
BOM_NONE,
BOM_UTF8,
BOM_UTF16LE,
BOM_UTF16BE,
BOM_UTF32BE,
BOM_UTF32LE,
};
HRESULT BOMFromStream(Byte pbBytes[], UINT cbLength, BOM *pBOM) {
if (NULL == pbBytes || NULL == pBOM) {
return E_POINTER;
}
// need at least two bytes for UTF16 BOMs
if (cbLength >= 2) {
if (0xFE == pbBytes[0] && 0xFF == pbBytes[1]) {
*pBOM = BOM_UTF16BE;
return S_OK;
if (0xFF == pbBytes[0] && 0xFE == pbBytes[1]) {
*pBOM = BOM_UTF16LE;
// need at least three bytes for UTF8 BOM
if (
cbLength >= 3 &&
0xEF == pbBytes[0] &&
0xBB == pbBytes[1] &&
0xBF == pbBytes[2]
) {
*pBOM = BOM_UTF8;
// need at least four bytes for UTF32 BOMs
if (cbLength >= 4) {
0 == pbBytes[0] &&
0 == pbBytes[1] &&
0xFE == pbBytes[2] &&
0xFF == pbBytes[3]
*pBOM = BOM_UTF32BE;
0xFF == pbBytes[0] &&
0xFE == pbBytes[1] &&
0 == pbBytes[2] &&
0 == pbBytes[3]
*pBOM = BOM_UTF32LE;
// if we made it this far there's no recognizable BOM
*pBOM = BOM_NONE;
Possible future additional features: sanity check UTF16 byte stream length is even, UTF32 is divisible by 4; advance byte stream by length of BOM.
Michael Kaplan posted a small challenge on his blog to write some small code to find out the Byte Order Marker or BOM in the start of a file. So I kicked up the C# compiler and wrote this little bit of code:public enum Encoding{ Unknown = 0, BomBigEnd
I've had a quick stab in C# I've put a longer version that does appendix F (also a port to C for byte counting purposes) on my blog here:
http://www.ibbotson.co.uk/peteri/index.php?/archives/120-Finding-the-BOM.html
public enum Encoding
{
Unknown = 0, BomBigEndianUcs4, BomUcs4, BomUtf8,
BomUtf16, BomBigEndianUtf16
// We use Bit 3 as a end of data marker, true means end
// bit 5 happens to be same value as bit 3
private static byte[] matchData =
0x00,0x00,0xF6,0xFF, // 0- 00 00 FE FF Bom UCS4 Big endian
0xF7,0xF6,0x00,0x08, // 4- FF FE 00 00 Bom UCS4 Little endian
0xE7,0xB3,0xBF, // 8- EF BB BF Bom UTF8
0xF7,0xFE, // 12 - FF FE Bom UTF16 Little endian
0xF6,0xFF // 14 - FE FF Bom UTF16 Big endian
public static Encoding DetectType(byte[] data)
int i = 0;
int offset = 0;
Encoding currentEncoding = Encoding.BomBigEndianUcs4;
while (i < matchData.Length)
byte compare = (byte)((matchData[i] & 0xf7) | ((matchData[i] & 0x20) >> 2));
if ((offset >= data.Length) || (data[offset] != compare))
offset = 0;
while ((matchData[i] & 0x08) == 0) i++;
currentEncoding++;
else
if ((matchData[i] & 0x08) == 0x08) return currentEncoding;
offset++;
i++;
Ok, everyone seems to be just testing for the bytes in order. I realise that I'm posting late, so may have to give up the points race, but here's my version. Auto-calculates the BOM based on endianness and wchar size and compares the char* against that. Except for UTF-8. I gave up on that (its midnight here & I'm going to bed now).
#include <string>#include <cstring>
// Same-endian: feff// Different-endian: fffeenum endianness { be = -1, le = 1};
bool compare_bom_string(int sizeof_wchar, endianness end, const char* data){ std::string bom(sizeof_wchar, 0); int pos = (end==be?3:0); bom[pos] = '\xFF'; bom[pos+end] = '\xFE';
return !memcmp((void*)data, (void*)bom.c_str(), sizeof_wchar);}
struct bom { bom(int i, endianness e) : sizeof_wchar(i), end(e) { } int sizeof_wchar; endianness end;};
int wchar_sizes[] = { 4, 2 };endianness ends[] = { be, le };
struct ex { ex(const char*m) : msg(m) { } const char * what() { return msg; }
const char *msg;
bom sniff(const char* data){ for (int i = 0; i < sizeof wchar_sizes; ++i) for (int j = 0; j < sizeof ends; ++j) if (compare_bom_string(wchar_sizes[i], ends[j], data)) return bom(wchar_sizes[i], ends[j]);
// Just got lazy const char* utf_8 = "\xEF\xBB\xBF";
if (!memcmp((void*)data, (void*)utf_8, 3)) return bom(1,le);
// Just got lazier throw ex("Whoops");;}
import Maybeimport List
detectBOM s = snd . fromJust $ find ((flip isPrefixOf) s . fst) byteOrderMarks where byteOrderMarks = [("\xef\xbb\xbf","UTF-8"), ("\x00\x00\xfe\xff","UTF-32BE"), ("\xff\xfe\x00\x00","UTF-32LE"), ("\xfe\xff","UTF-16BE"), ("\xff\xfe","UTF-16LE"), ("\x2b\x2f\x76\x38","UTF-7"), ("\x2b\x2f\x76\x39","UTF-7"), ("\x2b\x2f\x76\x2b","UTF-7"), ("\x2b\x2f\x76\x2f","UTF-7"), ("\xf7\x64\x4c","UTF-1"), ("\xdd\x73\x66\x73","UTF-EBCDIC"), ("\x0e\xfe\xff","SCSU"), ("\xfb\xee\x28","BOCU-1"), ("\x84\x31\x95\x33","GB18030"), ("","NO BOM")]