andersch.dev

[ cpp ]
<2024-05-31>

Embedding Binaries in C(++)

Five ways to bake any file as a buffer into your executable

When programming, it can be desirable to embed the data of arbitrary binary files directly in the final executable of your application. This is great to:

Here is a list of some of the ways you can achieve this (comfortably) in C and C++.

Table of Contents

1. Convert to C Code

By converting a binary file to to a properly formatted char array, we can simply include the resulting code in the source code. Programs to do this are xxd, convert or bin2h.

Example

In the case of xxd, running xxd -i file.ext will output C code:

unsigned char file_ext[] = {
  0x58, 0x61, 0x59, 0x62, 0x58, 0x37, 0x70, 0x78, 0x32, 0x4e, 0x35, 0x70,
  0x41, 0x59, 0x56, 0x39, 0x0a, /* ... */
};
unsigned int file_ext_len = 1234;

If you want to have e.g. a folder of binary files always be at the ready for embedding, you could include the following in your build script:

# generate a C char array for all files in the "res" folder
for i in $(ls "res")
do
    xxd -i "res/${i}" | sed -e 1d -e '$d' | sed -e '$d' > "inc/${i}"
done

By using some sed operations and by keeping the generated file under the same name as the binary, we can #include the file in a way that reflects my inclusion of glsl shaders in source code:

unsigned char buffer[] = {
    #include "texture.png"
};

Pros & Cons

  • ✅ Should work everywhere
  • ✅ Binaries can be included with their original name
  • ✅ Buffer as a real array (i.e. sizeof(buffer) works)
  • ❌ Adds a build dependency
  • ❌ Adds a precompilation step
  • ❌ Generated files are larger in size than the binaries
  • ❌ Slows down build times

Improving build times with strliteral

Parsing hex-formatted char arrays from a program like xxd like above can be slow for large binaries. It turns out that parsing string literals containing escaped byte values is much faster1, which is the approach of the tool strliteral.

Since it's contained in a single C file, you can also trivially compile it as part of your build system, which eliminates the disadvantage of an external build dependency.

2. Use the linker

Instead of getting the binary into a compilable format, we can go one step further and "compile" it directly. The output is an object file with predefined symbols that can be linked against.

Programs (or linkers) that can do this are ld, objcopy or bin2coff, bin2obj on Windows.

Example

The basic usage with ld:

ld -r -b binary data.bin -o data.o

clang -o main main.c data.o

Can be generalized in your build script like this:

OBJECT_FILES=()
for i in $(ls "res")
do
    cd "res"
    OBJECT_FILES+="${i%.*}.o "
    ld -r -b binary "${i}" -o "../${i%.*}.o"
    # OR
    # objcopy --input binary --output elf64-x86-64 "${i}" "../${i%.*}.o"
    cd ".."
done

clang -o main main.c ${OBJECT_FILES[*]}

This will generate symbols in the .o file that can be accessed like this:

extern const unsigned char _binary_file_ext_start[];
extern const unsigned char _binary_file_ext_end[];
extern const unsigned char _binary_file_ext_size; // NOTE: access with (size_t)&_binary_file_ext_size

Both ld and objcopy do not include a way to change these symbol names when generating the object files, so to make usage in your code a bit more comfortable, you can define some macros to help you:

#define BINARY_INCLUDE(file, ext)                               \
  extern const unsigned char _binary_##file##_##ext##_start[];  \
  extern const unsigned char _binary_##file##_##ext##_end[]

#define BINARY_BUFFER(file, ext)        _binary_##file##_##ext##_start
#define BINARY_BUFFER_SIZE(file, ext)   _binary_##file##_##ext##_end - _binary_##file##_##ext##_start

Which makes usage look like this:

BINARY_INCLUDE(data, bin); // filename & ext separated by a comma without quotes

int main()
{
    unsigned char* my_buffer      = BINARY_BUFFER(data, bin);
    unsigned int   my_buffer_size = BINARY_BUFFER_SIZE(data, bin);
}

Pros & Cons

  • ✅ No added build dependency (since we already depended on having linker)
  • ✅ Faster build times than first option
  • ✅ Can specify different types (not just char)
  • ✅ Smaller filesizes compared to first option
  • ✅ Can be cross-platform…
  • ❌ …but may require a different tool for each platform
  • ❌ Adds a precompilation step (and arguably more complex than first option)
  • ❌ Memory always const (i.e. needs a memcpy to mutate it)
  • ❌ No real array, just a pointer and size (i.e. sizeof(buffer) doesn't work)
  • ❌ No access to extern data or size at compile-time (only after linking)
  • ❌ Arguably worse ergonomics: MY_INCLUDE(file, ext) vs. #include "file.ext"

3. Inline Assembly using .incbin

.incbin is a GNU directive that can be used in asm blocks to basically perform the linking step from before inside the application code:

#define BINARY_ASM_INCLUDE(filename, buffername)     \
    __asm__(".section .rodata\n"                     \
         ".global " #buffername "\n"                 \
         ".type   " #buffername ", @object\n"        \
         ".align  4\n"                               \
     #buffername":\n"                                \
         ".incbin " #filename "\n"                   \
     #buffername"_end:\n"                            \
         ".global "#buffername"_size\n"              \
         ".type   "#buffername"_size, @object\n"     \
         ".align  4\n"                               \
     #buffername"_size:\n"                           \
         ".int   "#buffername"_end - "#buffername"\n"\
    );                                               \
    extern const unsigned char buffername [];        \
    extern const unsigned char* buffername##_end;    \
    extern int buffername##_size

Usage code becomes:

BINARY_ASM_INCLUDE("image.png", image_buf);

int main()
{
    int width, height, nrChannels;
    unsigned char* image_data = stbi_load_from_memory(image_buf, image_buf_size,
                                                      &width, &height, &nrChannels, 0);
}

Pros & Cons

Same as the linker option, except…

  • ✅ Choose names of buffer and size
  • ✅ Better ergonomics: Use buffer and size directly
  • ✅ No precompilation step
  • ❌ Not cross-platform (GCC & Clang support .incbin)

4. Use a library

The library incbin actually uses the previous approach by default and aims to be cross-platform. In case of MSVC, it falls back to using the first option by providing a tool that needs to be compiled and included in your build step2.

The usage code looks basically like this:

#define INCBIN_PREFIX  // remove prefix from variables
#define INCBIN_STYLE INCBIN_STYLE_SNAKE // data instead of Data
#include "incbin.h"

INCBIN(song, "music.mp3"); // defines song_data, song_end and song_size

Pros & Cons

Same as the .incbin option, except…

  • ✅ Can be cross-platform
  • ✅ No precompilation step…
  • ❌ …except for MSVC
  • ❌ Adds a dependency

5. Using #embed

A new #embed directive has been introduced to C233 and C++264.

It's still too early for me to really use this, but usage-wise, it is supposed to be similar to the first approach:

static const unsigned char embedded_texture[] = {
    #embed "texture.png"
};

This would be the best and fastest option, since it does not introduce a new preprocessing step and skips the code generation and parsing step. However, implementation of #embed in current compilers is not yet wide spread, so it may not be an option for you.

Pros & Cons

  • ✅ Fastest & easiest way
  • ❌ Requires modern compiler support

6. Resources

Footnotes

2

Apparently this is due to fact that the MSVC compiler doesn't support an .incbin equivalent in its inline assembly


Comments