Note that byte arrays could be saved many places, also within files or similar.
The extension of a file can be discovered by inspect the File header. This is the first bytes, usually the first tens or hundreds of bytes of the byte array and constitute the file header. Some extensions got multiple file headers. A best effort to identity byte contents of a column in a database.
Let's use Powershell to inspect a file on disk, a sample JPEG file (.jpg). Lets run the following little script:
format-hex .\Stavkyrkje_Røldal.jpg | Select-Object -First 16
The first few bytes are FF D8 FF
I have added a sample Github repo with utility code to check well-known file types for their file extensions.
https://github.com/toreaurstadboss/FileHeaderUtil
The following screenshot shows the application in use. It found out that a byte array seems to be a PDF file by looking at the file header and file trailer. A good match was found :
In fact, a very good match, since both the header and the trailer fully agrees. Note that the 0A bytes are just padding bytes at the end of files and ignored in this util. See the method NormalizeHex presented further below.
Using Gary Kessler`s assembled lists of known file headers and trailers for well-known file types
The util class below shows the helper methods that inspects a byte array and evalues the file header and file trailer against a list of known such headers and trailers.It bases a compilation of known file headers and file trailers known as "Magic Numbers", compiled by Gary Kessler during the years. In all, 600+ known file types are checked against to classify the matching file extension. Please note that there are cases where multiple matches exists of file header and file trailers matching the given byte array. The matches are sorted by number of matching bytes. The assembled list is very helpful. Thanks, Gary !
Using the file header and also possibly the last bytes of a byte array, the file trailer, we can classify the file type we have in the byte array, i.e. file extension is also implied here by recognizing the file array.
Of course, if one is allowing byte array to be uploaded from a public site for example, it still would be possible to inject malicious bytes, but being able to detect the kind of file is useful both concerning security policies and also determine if the bytes should be handled by an external application or provide information to the end-user what kind of file we have provided a path to for this util.
The curated list of file headers is based upon the list of signatures gathered by Gary Kessler and published on his website here (license of that file is not stated and considered public as it is publicly available information on his website not marked with a license):
https://www.garykessler.net/library/file_sigs.html
This list contains about 650 file types and should cover most of the wellknown formats, including formats not being used so often anymore. If you want to augment the list, check other sources such as Wikipedia if there is information about the given file extension's file header and/or file trailer, so-called "Magic number".
The curated list was updated 3rd June 2023 and contains most well-known file types.
The program uses the file signatures (Json format) to identity the file types of a byte array. Most usually, this is judged by looking at the first few bytes of the file (the so-called "magic numbers"). Sometimes, the file signature may also include bytes from the end of the file (the "trailer").
FileSignatureUtil.cs
using System;
using System.Collections.Generic;
using System.Text;
namespace FileHeaderUtil;
public static class FileSignatureUtil
{
static FileSignature[] _fileSignatures = [];
static FileSignatureUtil()
{
string json = File.ReadAllText("file_Sigs.json");
var fileSignaturesRoot = System.Text.Json.JsonSerializer.Deserialize<FileSignatureRootElement>(json, new System.Text.Json.JsonSerializerOptions
{
PropertyNameCaseInsensitive = true
});
_fileSignatures = fileSignaturesRoot?.FileSigs?.ToArray()!;
}
/// <summary>
/// Scans the specified file and returns a list of file signatures that match the file's header and, if applicable,
/// file's trailer.
/// </summary>
/// <remarks>Only file signatures with a defined header are considered for matching. Trailer matching is
/// performed if both the file and the signature define a trailer. A header and trailer of 64 bytes is evaluted to also
/// detect file types / extensions with longer headers and trailers.</remarks>
/// <param name="targetFile">The path to the file to be analyzed. Cannot be null or empty.</param>
/// <param name="byteCount">The number of bytes to read from the file for signature matching. Defaults to 64.</param>
/// <param name="offset">The byte offset at which to begin reading the file for signature matching. Defaults to 0.</param>
/// <param name="origin">Specifies the reference point used to obtain the offset. Defaults to <see cref="SeekOrigin.Begin"/>.</param>
/// <returns>A list of <see cref="FileSignature"/> objects that match the file's header and trailer. The list is empty if no
/// signatures match.</returns>
public static List<FileSignature> GetMatchingFileSignatures(string targetFile, int byteCount = 64, int offset = 0, SeekOrigin origin = SeekOrigin.Begin)
{
static string NormalizeHex(string? hex, bool trimPadding)
{
if (string.IsNullOrWhiteSpace(hex))
{
return string.Empty;
}
var parts = hex.Replace("-", " ").Split(new[] { ' ', }, StringSplitOptions.RemoveEmptyEntries)
.Select(h => h.ToUpperInvariant())
.ToList();
if (trimPadding)
{
while (parts.Count > 0 && (parts.Last() == "0A" || parts.Last() == "0D" || parts.Last() == "00"))
{
parts.RemoveAt(parts.Count - 1);
}
}
return string.Join(" ", parts);
}
var matches = new List<(FileSignature Sig, int Score)>();
string fileHeader = NormalizeHex(FileUtil.ShowHeader(targetFile, offset: 0), trimPadding: false);
string fileTrailer = NormalizeHex(FileUtil.ShowTrailer(targetFile), trimPadding: true);
foreach (var signature in _fileSignatures)
{
if (string.IsNullOrWhiteSpace(signature?.HeaderHex) || signature.HeaderHex == "(NULL)")
continue;
string sigHeader = NormalizeHex(signature.HeaderHex, trimPadding: false);
string sigTrailer = NormalizeHex(signature.TrailerHex, trimPadding: true);
if (!fileHeader.StartsWith(sigHeader, StringComparison.OrdinalIgnoreCase))
continue;
// Trailer check if defined
if (!string.IsNullOrWhiteSpace(sigTrailer) && sigTrailer != "(NULL)")
{
if (!fileTrailer.EndsWith(sigTrailer, StringComparison.OrdinalIgnoreCase))
continue;
}
// Compute match score (# of matching bytes in header and trailer of file)
int headerScore = CountMatchingPrefix(fileHeader, sigHeader);
int trailerScore = CountMatchingSuffix(fileTrailer, sigTrailer);
int scoreMeasuredAsMatchingByteCount = headerScore + trailerScore;
signature.MatchingBytesCount = scoreMeasuredAsMatchingByteCount;
signature.MatchingTrailerBytesCount = trailerScore;
signature.MatchingHeaderBytesCount = headerScore;
matches.Add((signature, scoreMeasuredAsMatchingByteCount));
}
return matches.OrderByDescending(m => m.Score).Select(m => m.Sig).ToList();
}
// Helpers
private static int CountMatchingPrefix(string source, string pattern)
{
var srcParts = source.Split(' ');
var patParts = pattern.Split(' ');
int count = 0;
for (int i = 0; i < Math.Min(srcParts.Length, patParts.Length); i++)
{
if (srcParts[i].Equals(patParts[i], StringComparison.OrdinalIgnoreCase))
count++;
else break;
}
return count;
}
private static int CountMatchingSuffix(string source, string pattern)
{
if (string.IsNullOrWhiteSpace(pattern)) return 0;
var srcParts = source.Split(' ');
var patParts = pattern.Split(' ');
int count = 0;
for (int i = 0; i < Math.Min(srcParts.Length, patParts.Length); i++)
{
if (srcParts[srcParts.Length - 1 - i].Equals(patParts[patParts.Length - 1 - i], StringComparison.OrdinalIgnoreCase))
count++;
else break;
}
return count;
}
}
As we can see in the source code of NormalizeHex, ending padding chars are removed at the end, since in some cases, byte arrays (files or byte columns in databases for examples) are padded with certain bytes. Also, upper-case is applied and '-' is replaced by space ' '.
In the example below, a PDF file is scanned with the console app and the PDF file header and trailer is recognized. In this case, we also peel of trailing bytes at the end, as the specific PDF file had trailing bytes of pad bytes, more specifically : 0A.
FileUtil.cs
The util class here is used to load a file header or file trailer, a smaller byte array usually. 64 bytes is default evaluated here and should cover most file types file headers and file trailers, actually most file types only has 8 bytes or even less as a file header or file trailer.
namespace FileHeaderUtil
{
/// <summary>
/// Helper class for file operations
/// </summary>
public static class FileUtil
{
/// <summary>
/// Prints the file header HEX representation
/// </summary>
/// <param name="filePath"></param>
/// <param name="byteCount">Read the first n bytes. Defaults to 64 bytes.</param>
/// <returns></returns>
public static string? ShowHeader(string filePath, int byteCount = 64, int offset = 0)
{
if (!File.Exists(filePath))
{
throw new FileNotFoundException(filePath);
}
byte[] header = ReadBytes(filePath, byteCount, offset, SeekOrigin.Begin);
if (header == null)
{
return null;
}
return BitConverter.ToString(header);
}
/// <summary>
/// Prints the file trailer HEX representation
/// </summary>
/// <param name="filePath"></param>
/// <param name="byteCount">Read the last n bytes. Defaults to 64 bytes.</param>
/// <returns></returns>
public static string? ShowTrailer(string filePath, int byteCount = 64, int offset = 0)
{
if (!File.Exists(filePath))
{
throw new FileNotFoundException(filePath);
}
byte[] header = ReadBytes(filePath, byteCount, offset, SeekOrigin.End);
if (header == null)
{
return null;
}
return BitConverter.ToString(header);
}
/// <summary>
/// Reads the n bytes of a byte array. Either from the start or the end of the byte array.
/// </summary>
/// <param name="filePath">File path of target file to read the byets</param>
/// <param name="byteCount">The number of bytes to read</param>
/// <param name="offset">Offset - number of bytes</param>
/// <param name="origin">Origin to seek from. Can be either SeekOrigin.Begin, SeekOrigin.Current or SeekOrigin.End</param>
/// <returns></returns>
private static byte[] ReadBytes(string filePath, int byteCount, int offset = 0, SeekOrigin origin = SeekOrigin.Begin)
{
if (!File.Exists(filePath))
{
throw new FileNotFoundException(filePath);
}
if (byteCount < 1)
{
return Array.Empty<byte>();
}
byte[] buffer = new byte[byteCount];
using var fileStream = new FileStream(filePath, FileMode.Open, FileAccess.Read);
if (origin == SeekOrigin.Begin && offset > 0)
{
fileStream.Seek(offset, origin);
}
else if (origin == SeekOrigin.End)
{
fileStream.Seek(-1 * Math.Abs(offset+byteCount), origin);
}
else
{
//origin must be Current - offset is expected from the current position, just like SeekOrigin.Begin
fileStream.Seek(offset, origin);
}
int bytesRead = fileStream.Read(buffer, 0, byteCount);
if (bytesRead < byteCount)
{
Array.Resize(ref buffer, bytesRead);
}
return buffer;
}
}
}
This console app will only consider max three matching file headers/trailers in cases where multiple such byte array pairs matches a given byte array of a file.
To adjust this, see in Program.cs and adjust the Take parameter. Matches are ordered by number of bytes matching.



