Someone please detect if there's a BOM before the plane takes off!

Sorting it all Out
Michael Kaplan's random stuff of dubious value
Be sure to read the disclaimer here first!

Someone please detect if there's a BOM before the plane takes off!

  • Comments 9

One can really never get enough of puns about the BOM (Byte Order Mark) and TSA.

And when I say one, I mean I. :-)

Just think back to blogs like Don't sneak a BOM in on someone who promises to ignore free space or Everyone seems averse to the BOM these days; Should we blame TSA? :-) or How to get yourself imprisoned [by/for talking about Unicode].

See what I mean?

I was reminded of this when Pritam asked:

Is there any tool or code available to verify Byte Order Mark signature in XML files?

Of course sniffing out a few bytes is easy enough. Abhinaba provided the full chart of valid BOM values:

Bytes

Encoding Form

00 00 FE FF

UTF-32, big-endian

FF FE 00 00

UTF-32, little-endian

FE FF

UTF-16, big-endian

FF FE

UTF-16, little-endian

EF BB BF

UTF-8

Easy, right?

Okay, anyone want to make a try at writing the minimal code BOM detector?

Think of it as a way to play your part in airport security!

Points awarded for clearest, or for most concise, or for briefest, or for most clever, or for the sake of maintainability, most smart.

If you can write something able to handle other, non-standard byte orderings of data, then you probably went to Cal Tech! :-)


This post brought to you by U+feff, aka ZERO WIDTH NO-BREAK SPACE)

Comment on the blather
Leave a Comment
  • Please add 6 and 6 and type the answer here:
  • Post
Blog - Comment List
  • wait...which plane? BMP? SMP? SIP? TIP? SSP?

    Sorry, I have nothing else useful to contribute here, though I'm a little surprised this problem isn't solved already...?!?

  • See http://recycledknowledge.blogspot.com/2005/07/hello-i-am-xml-encoding-sniffer.html for a formal English description of what you have to do to play in the Appendix F leagues.

  • Mike, you're the only person I know of who pronounces "bee-oh-em" as a word. Cakemakers and codebreakers have every right to say "bomb/bombe" but not we I18Ners.

  • Dude, you lead a sheltered life. In Unicode and related standards circles, in i18n conversations with developers at Adobe, Apple, IBM, Google, and Microsoft -- it is pronounced as a single word all the time....

  • Here's my approach:

    enum BOM {

    BOM_NONE,

    BOM_UTF8,

    BOM_UTF16LE,

    BOM_UTF16BE,

    BOM_UTF32BE,

    BOM_UTF32LE,

    };

    HRESULT BOMFromStream(Byte pbBytes[], UINT cbLength, BOM *pBOM) {

    if (NULL == pbBytes || NULL == pBOM) {

    return E_POINTER;

    }

    // need at least two bytes for UTF16 BOMs

    if (cbLength >= 2) {

    if (0xFE == pbBytes[0] && 0xFF == pbBytes[1]) {

    *pBOM = BOM_UTF16BE;

    return S_OK;

    }

    if (0xFF == pbBytes[0] && 0xFE == pbBytes[1]) {

    *pBOM = BOM_UTF16LE;

    return S_OK;

    }

    }

    // need at least three bytes for UTF8 BOM

    if (

    cbLength >= 3 &&

    0xEF == pbBytes[0] &&

    0xBB == pbBytes[1] &&

    0xBF == pbBytes[2]

    ) {

    *pBOM = BOM_UTF8;

    return S_OK;

    }

    // need at least four bytes for UTF32 BOMs

    if (cbLength >= 4) {

    if (

    0 == pbBytes[0] &&

    0 == pbBytes[1] &&

    0xFE == pbBytes[2] &&

    0xFF == pbBytes[3]

    ) {

    *pBOM = BOM_UTF32BE;

    return S_OK;

    }

    if (

    0xFF == pbBytes[0] &&

    0xFE == pbBytes[1] &&

    0 == pbBytes[2] &&

    0 == pbBytes[3]

    ) {

    *pBOM = BOM_UTF32LE;

    return S_OK;

    }

    }

    // if we made it this far there's no recognizable BOM

    *pBOM = BOM_NONE;

    return S_OK;

    }

    Possible future additional features: sanity check UTF16 byte stream length is even, UTF32 is divisible by 4; advance byte stream by length of BOM.

  • Michael Kaplan posted a small challenge on his blog to write some small code to find out the Byte Order Marker or BOM in the start of a file. So I kicked up the C# compiler and wrote this little bit of code:public enum Encoding{ Unknown = 0, BomBigEnd

  • I've had a quick stab in C# I've put a longer version that does appendix F (also a port to C for byte counting purposes) on my blog here:

    http://www.ibbotson.co.uk/peteri/index.php?/archives/120-Finding-the-BOM.html

    public enum Encoding

    {

       Unknown = 0, BomBigEndianUcs4, BomUcs4, BomUtf8,

       BomUtf16, BomBigEndianUtf16

    }

    // We use Bit 3 as a end of data marker, true means end

    // bit 5 happens to be same value as bit 3

    private static byte[] matchData =

       {

           0x00,0x00,0xF6,0xFF,    //  0- 00 00 FE FF Bom UCS4 Big endian

           0xF7,0xF6,0x00,0x08,    //  4- FF FE 00 00 Bom UCS4 Little endian

           0xE7,0xB3,0xBF,         //  8- EF BB BF    Bom UTF8

           0xF7,0xFE,              // 12 - FF FE      Bom UTF16 Little endian

           0xF6,0xFF               // 14 - FE FF      Bom UTF16 Big endian

       };

    public static Encoding DetectType(byte[] data)

    {

       int i = 0;

       int offset = 0;

       Encoding currentEncoding = Encoding.BomBigEndianUcs4;

       while (i < matchData.Length)

       {

           byte compare = (byte)((matchData[i] & 0xf7) | ((matchData[i] & 0x20) >> 2));

           if ((offset >= data.Length) || (data[offset] != compare))

           {

               offset = 0;

               while ((matchData[i] & 0x08) == 0) i++;

               currentEncoding++;

           }

           else

           {

               if ((matchData[i] & 0x08) == 0x08) return currentEncoding;

               offset++;

           }

           i++;

       }

  • Ok, everyone seems to be just testing for the bytes in order. I realise that I'm posting late, so may have to give up the points race, but here's my version. Auto-calculates the BOM based on endianness and wchar size and compares the char* against that. Except for UTF-8. I gave up on that (its midnight here & I'm going to bed now).

    #include <string>
    #include <cstring>

    // Same-endian: feff
    // Different-endian: fffe
    enum endianness {
      be = -1,
      le = 1
    };

    bool compare_bom_string(int sizeof_wchar, endianness end, const char* data)
    {
      std::string bom(sizeof_wchar, 0);
      int pos = (end==be?3:0);

      bom[pos] = '\xFF';
      bom[pos+end] = '\xFE';

      return !memcmp((void*)data, (void*)bom.c_str(), sizeof_wchar);
    }

    struct bom {
      bom(int i, endianness e) : sizeof_wchar(i), end(e) { }
      int sizeof_wchar;
      endianness end;
    };

    int wchar_sizes[] = { 4, 2 };
    endianness ends[] = { be, le };

    struct ex {
      ex(const char*m) : msg(m) { }
      const char * what() { return msg; }

      const char *msg;

    };

    bom sniff(const char* data)
    {
      for (int i = 0; i < sizeof wchar_sizes; ++i)
         for (int j = 0; j < sizeof ends; ++j)
            if (compare_bom_string(wchar_sizes[i], ends[j], data)) return bom(wchar_sizes[i], ends[j]);

      // Just got lazy
      const char* utf_8 = "\xEF\xBB\xBF";

      if (!memcmp((void*)data, (void*)utf_8, 3)) return bom(1,le);

      // Just got lazier
      throw ex("Whoops");;
    }

  • import Maybe
    import List

    detectBOM s = snd . fromJust $ find ((flip isPrefixOf) s . fst) byteOrderMarks
       where byteOrderMarks = [("\xef\xbb\xbf","UTF-8"),
                               ("\x00\x00\xfe\xff","UTF-32BE"),
                               ("\xff\xfe\x00\x00","UTF-32LE"),
                               ("\xfe\xff","UTF-16BE"),
                               ("\xff\xfe","UTF-16LE"),
                               ("\x2b\x2f\x76\x38","UTF-7"),
                               ("\x2b\x2f\x76\x39","UTF-7"),
                               ("\x2b\x2f\x76\x2b","UTF-7"),
                               ("\x2b\x2f\x76\x2f","UTF-7"),
                               ("\xf7\x64\x4c","UTF-1"),
                               ("\xdd\x73\x66\x73","UTF-EBCDIC"),
                               ("\x0e\xfe\xff","SCSU"),
                               ("\xfb\xee\x28","BOCU-1"),
                               ("\x84\x31\x95\x33","GB18030"),
                               ("","NO BOM")]

Page 1 of 1 (9 items)