Hashcat'ing XenForo

Supporting the unsupported sha256(sha256($pass).$salt) hash type

Travis dropped Mon, Mar 26, 2018

TL;DR - You can skip right to the results for the code.

A list that recently hit Hashes.org, with 1 million records and a low crack rate, looked like an interesting target, given that the community had recovered less than 0.5% of the list. On taking a closer look, we find out why;

Out of the box, only John the Ripper (JtR) Jumbo carries support for the XenForo hash scheme¹ as a dynamic hash².

[List.Generic:dynamic_1503]
Expression=sha256(sha256($p).$s) (XenForo SHA-256)
Flag=MGF_INPUT_32_BYTE
Flag=MGF_SALTED
Flag=MGF_FLAT_BUFFERS
Flag=MGF_KEYS_BASE16_IN1_SHA256
MaxInputLenX86=110
SaltLen=-120 // dont know, so made it big, jfoug
MaxInputLen=110
Func=DynamicFunc__set_input_len_64
Func=DynamicFunc__append_salt
Func=DynamicFunc__SHA256_crypt_input1_to_output1_FINAL
Test=$dynamic_1503$453f2e21fa6c150670d3ecf0e4a0ff3bab8b1903c2e96ad655d960b95f104248$697de9eda4a02563a7ec66d42d4a96995cb2948e29ab76fbcc89e8db71dd10f1:password
Test=$dynamic_1503$a8a0e9545c1475e8546f8546d87fe2516cf525c12ad79a6a7a8fee2fb0d8afd3$697de9eda4a02563a7ec66d42d4a96995cb2948e29ab76fbcc89e8db71dd10f1:verlongcrappypassword01234567890

JtRJumbo’s XenForo defintion.

JtR’s dynamic hashes don’t run on OpenCL, and a benchmark on a single core resulted in 4MH/s, while during a run we hit around 80MH/s across 20 CPU Cores. (H/T Solar Designer for the correction.)

What we had was a pretense for writing a custom OpenCL kernel for Hashcat—perfect. Adding to the effort was the lack of documentation around Hashcat (and sparse code comments).

My general methodology was running Hashcat in single-hash mode against the kernel and using printf() debugging statements to understand what was going on under the hood. Hashcat’s other kernels and its OpenSSL-style functions were definitely key to getting something working in a limited amount of time.

Orientation

./hashcat/include/interface.h³: enums and values required to add a new hash type.
./hashcat/src/interface.c⁴: Blocks that describe each hash type. It contains strings for Hashcat’s help function, as well as various constants and hash-parsing functions.
./hashcat/kernels/: Compiled OpenCL kernels. Hashcat compiles each OpenCL kernel at runtime and caches those kernels here.
./hashcat/OpenCL/⁵: OpenCL kernel sources (.cl files).

Play-by-play

let me smoke a cigarette, and I’ll tell you.

Inspiration

To start with, we identify an OpenCL a0 kernel (or set of kernels) that closely matches our workload and input type. For example, if your hash is salted, find a salted hash of the same type, and likewise if it is iterated.

Copy it to its own kernel file, designated as m[Hashcat Mode Flag]_a0-pure.cl. Search/replace this source to fix up the mode function names. In this guide, I used m01410_a0-pure.cl⁶ as the donor kernel. I’ll be using m01415_a0-pure.cl for the new mode 1415.

Hashcat follows a loose naming convention for modes: xxxyy, where xxx denotes the family and yy is either 00 for unsalted, 10 for an appended salt, or 20 for a prepended salt. There are plenty of counter-examples, but this is the general scheme.

Adding the mode

Next up, add the new mode to Hashcat.

We add a new enum value SHA256_PW_SHA256_SLT to interface.h and set it equal to 1415:

KERN_TYPE_SHA256_PWSLT = 1410,
KERN_TYPE_SHA256_PW_SHA256_SLT = 1415,
KERN_TYPE_SHA256_SLTPW = 1420,

Grabbing a new Hashcat -mode flag, 1415, for the XenForo mode.

HT_* is CLI / Help-related Text and a new entry HT_01415 is added to interface.c:

case 1411: return HT_01411;
case 1415: return HT_01415;
case 1420: return HT_01420;
static const char *HT_01410 = "sha256($pass.$salt)";
static const char *HT_01415 = "sha256(sha256($pass).$salt)";
static const char *HT_01420 = "sha256($salt.$pass)";

Adding reference to our new XenForo mode in interface.c.

A test hash ST_HASH_01415 plus its corresponding plaintext ST_PASS_HASHCAT_PEANUT are used for self-testing:

static const char *ST_PASS_HASHCAT_PEANUT = "peanut";
…
static const char *ST_HASH_01415 = "00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449";
…
case 1411: hashconfig->hash_type = HASH_TYPE_SHA256;

Adding a test hash and plaintext to interface.c.

And in the main hash configuration block, the values were lifted from case 1410:, sha256($plain.$salt).

An important bit here is to set kern_type to the enum added to interface.h, in this case KERN_TYPE_SHA256_PW_SHA256_SLT. parse_func is a function pointer, set to sha256s_parse_hash - which will be invoked to parse the hash list. If your hash format can’t be consumed by these functions exactly, it will be rejected. Hashcat has a robust set of these parsing functions, and you’ll want to hunt through them in interface.c⁷.

opti_type does something. Reach out if you have some information about its semantics, and I’ll update this article.

case 1411: hashconfig->hash_type = HASH_TYPE_SHA256;
...
break;
case 1415: hashconfig->hash_type = HASH_TYPE_SHA256;
hashconfig->salt_type = SALT_TYPE_GENERIC;
hashconfig->attack_exec = ATTACK_EXEC_INSIDE_KERNEL;
hashconfig->opts_type = OPTS_TYPE_PT_GENERATE_BE
| OPTS_TYPE_ST_ADD80
| OPTS_TYPE_ST_ADDBITS15;
hashconfig->kern_type = KERN_TYPE_SHA256_PW_SHA256_SLT;
hashconfig->dgst_size = DGST_SIZE_4_8;
hashconfig->parse_func = sha256s_parse_hash;
hashconfig->opti_type = OPTI_TYPE_ZERO_BYTE
| OPTI_TYPE_PRECOMPUTE_INIT
| OPTI_TYPE_PRECOMPUTE_MERKLE
| OPTI_TYPE_EARLY_SKIP
| OPTI_TYPE_NOT_ITERATED
| OPTI_TYPE_APPENDED_SALT
| OPTI_TYPE_RAW_HASH;
hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;
hashconfig->st_hash = ST_HASH_01415;
hashconfig->st_pass = ST_PASS_HASHCAT_PEANUT;
break;

The final configuration block in interface.c.

Quick and dirty test

After a make clean && make, Hashcat should list your new type. Benchmark mode won’t be useful for testing, as the a0 kernels aren’t used here. Running a simple single-hash test against the new kernel will cause Hashcat to build it from the OpenCL directory to the kernels directory. This is when you’ll receive compile-time errors.

In this case, I had known hash/salt/plaintext values (recoverable from JtR and publicly listed). These are essential. You’ll find out how to use the raw results of intermediate steps in the scheme for debugging in the next section.

I used this example for testing:

00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449:peanut

Sample XenForo hash, salt, and plaintext tuple - for the plaintext ‘peanut’. The format is [hash]:[salt]:[plaintext].

Steps from the command line:

## echo -n peanut | shasum -a 256
5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06 –
## echo -n 5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06b4d93efdf7899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449 | shasum -a 256
00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa –

Re-create the sample hash for ‘peanut’ using shasum. This confirms our scheme.

Reviewing; with the $salt value b4d93efdf7899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449 where sha256('peanut') is 5509840d0873adb0405588821197a8634501293486c601ca51e14063abe25d06 and sha256(sha256('peanut') . $salt) is 00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa. As expected by sha256(sha256($pass).$salt).

Hashcat test:

## ./hashcat -m 1415 -a 0 “00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449” peanut.txt

hashcat (v4.0.1-123-g2095e27d+) starting…
OpenCL Platform #1: Apple

=========================

* Device #1: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz, skipped.
* Device #2: Intel(R) HD Graphics 530, 384/1536 MB allocatable, 24MCU
* Device #3: AMD Radeon Pro 460 Compute Engine, 1024/4096 MB allocatable, 16MCU

…
Watchdog: Temperature abort trigger disabled.
* Device #2: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver
* Device #3: ATTENTION! OpenCL kernel self-test failed.
Your device driver installation is probably broken.

See also: https://hashcat.net/faq/wrongdriver

Testing our new -mode 1415’s -a 0 kernel. peanut.txt contains the word ‘peanut’.

We now have a custom Hashcat build that lists our new type. Awesome. Let’s get these self-tests to pass with a single hash.

Kicking out the jams

You gotta have it, baby, you can’t do without

Oh, when you get that feeling you gotta sock ’em out

The majority of the work on an a0 kernel will be focused on these two functions:

We’re using the donor kernel m01410, and we’ve renamed all the 1410’s to 1415’s.

__kernel void m01415_mxx: Invoked to handle cracking multiple target hashes at once. We can ignore this during testing.
__kernel void m01415_sxx: Invoked in single-hash mode (which is how we’re testing). We can focus our development effort here.

Let’s take a walk through the donor kernel’s m01410_sxx function:

/**
* digest
*/
const u32 search[4] =
{
	digests_buf[digests_offset].digest_buf[DGST_R0],
	digests_buf[digests_offset].digest_buf[DGST_R1],
	digests_buf[digests_offset].digest_buf[DGST_R2],
	digests_buf[digests_offset].digest_buf[DGST_R3]
};

The first block of the m01410_sxx function.

In this kernel, Hashcat doesn’t bother comparing all the bytes in the hash buffer — a basic optimization technique. It’s unlikely to collide with a human-meaningful candidate as selected by Hashcat, even when doing partial matches. The bytes are configured in the dgst_pos* values of the hashconfig struct:

hashconfig->dgst_pos0 = 3;
hashconfig->dgst_pos1 = 7;
hashconfig->dgst_pos2 = 2;
hashconfig->dgst_pos3 = 6;

Our mode’s dgst_pos* values from interface.c.

Use values from another kernel that has the same final hash round as your target.

/**
* base
*/
COPY_PW (pws[gid]);
const u32 salt_len = salt_bufs[salt_pos].salt_len;
u32 s[64] = { 0 };
for (int i = 0, idx = 0; i < salt_len; i += 4, idx += 1)
{
	s[idx] = swap32_S (salt_bufs[salt_pos].salt_buf[idx]);
}

m01410_sxx preps s[] for sha256 so it can be appended before finalization.

The kernel prepares the salt buffer once and carries out operations usually performed by the Hashcat sha256 library’s sha256_update_swap function.

Now is a good time to browse through the methods in inc_hash_sha256.cl⁸. These aren’t documented in any way, and while they resemble other low-level C crypto implementations, some things have been changed to permit more efficient cracking. It helps to select methods out of this library, and those in its hash family (e.g., sha512, sha1), to understand how these functions can be used in various contexts.⁹

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
	pw_t tmp = PASTE_PW;
	tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
	sha256_ctx_t ctx;
	sha256_init (&ctx);
	sha256_update_swap (&ctx, tmp.i, tmp.pw_len);
	sha256_update (&ctx, s, salt_len);
	sha256_final (&ctx);
	const u32 r0 = ctx.h[DGST_R0];
	const u32 r1 = ctx.h[DGST_R1];
	const u32 r2 = ctx.h[DGST_R2];
	const u32 r3 = ctx.h[DGST_R3];
	COMPARE_S_SCALAR (r0, r1, r2, r3);
}

m01410_sxx calculates plaintext candidate hashes according to rules.

The function’s main loop permutes the password root candidate tmp.i according to rules identified in the rules_buf array. It then establishes a sha256 context, updates the context with the password candidate buffer, updates it (minus the swap already performed) with the salt, and then finalizes the hash. It then compares the bytes outlined in hashconfig->dgst_pos* for a match.

Because our source function is sha256($plain.$salt) and our target is sha256(sha256($plain).$salt), we know we need to add a round of sha256 to the plaintext before updating the final sha256 context, ctx.

Check out some other modes that might carry out similar operations…

Mode	Algorithm	Type
40	`md5($salt.utf16le($pass))`	raw hash, salted, and/or iterated
3710	`md5($salt.md5($pass))`	raw hash, salted, and/or iterated
140	`sha1($salt.utf16le($pass))`	raw hash, salted, and/or iterated
4520	`sha1($salt.sha1($pass))`	raw hash, salted, and/or iterated
1440	`sha256($salt.utf16le($pass))`	raw hash, salted, and/or iterated
1740	`sha512($salt.utf16le($pass))`	raw hash, salted, and/or iterated
30	`md5(utf16le($pass).$salt)`	raw hash, salted, and/or iterated
130	`sha1(utf16le($pass).$salt)`	raw hash, salted, and/or iterated
1430	`sha256(utf16le($pass).$salt)`	raw hash, salted, and/or iterated
1730	`sha512(utf16le($pass).$salt)`	raw hash, salted, and/or iterated

Some nice candidates include m02610_a0-pure.cl, m04700_a0-pure.cl, m04520_a0-pure.cl, m04500_a0-pure.cl, and m04400_a0-pure.cl. This builds up more examples of how the Hashcat developers used the libraries and OpenCL environment efficiently.

Let’s take a look at an example of a salted hash, with an extra round of hash on the plaintext. Thankfully, we have one in the same hash family: 4520¹⁰, sha1($salt.sha1($pass)).

Here’s the core routine:

sha1_ctx_t ctx1;
sha1_init (&ctx1);
sha1_update_swap (&ctx1, tmp.i, tmp.pw_len);
sha1_final (&ctx1);
const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
sha1_ctx_t ctx = ctx0;
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = 0;
w2[3] = 0;
w3[0] = 0;
w3[1] = 0;
w3[2] = 0;
w3[3] = 0;
sha1_update_64 (&ctx, w0, w1, w2, w3, 40);
sha1_final (&ctx);

m04520_a0-pure’s main loop.

We can see that this routine uses two sha1 contexts: ctx1 to contain sha1($pass) and ctx to contain sha1($salt.sha1($pass)). Additionally, this kernel introduces a new macro, uint_to_hex_lower8_le, and adds some logic to the *_sxx function:

const u64 lsz = get_local_size (0);
/**
* bin2asc table
*/
__local u32 l_bin2asc[256];
for (u32 i = lid; i < 256; i += lsz)
{
	const u32 i0 = (i >> 0) & 15;
	const u32 i1 = (i >> 4) & 15;
	l_bin2asc[i] = ((i0 < 10) ? ‘0’ + i0 : ‘a’ – 10 + i0) << 0
		| ((i1 < 10) ? ‘0’ + i1 : ‘a’ – 10 + i1) << 8;
}

bin2asc is a binary to ASCII hex lookup array.

uint_to_hex_lower8_le looks up a byte in l_bin2asc, which returns a lowercase ASCII representation in 2 bytes. We rely on this to convert the raw hex output of the first sha1 round to the lowercase ASCII representation that is usually composed in these salted routines.

The next step is understanding how to use this construct for sha256. We can find that sha1 has an output of 40 bytes, while sha256 outputs 64 bytes. Let’s set this up in our new kernel. First, add the uint_to_hex_lower8_le macro to the beginning of the file. Then add the routine to build the l_bin2asc array. Next, let’s change the core crack routine:

/**
* loop
*/
for (u32 il_pos = 0; il_pos < il_cnt; il_pos++)
{
	pw_t tmp = PASTE_PW;
	tmp.pw_len = apply_rules (rules_buf[il_pos].cmds, tmp.i, tmp.pw_len);
	sha256_ctx_t ctx; //containing the final context
	sha256_init (&ctx);
	sha256_ctx_t ctx1; //containing the sha256($pass) context
	sha256_init (&ctx1);
	sha256_update_swap (&ctx1, tmp.i, tmp.pw_len);
	sha256_final (&ctx1); //finalize the sha256 rounds.
	const u32 a = ctx1.h[0];
	const u32 b = ctx1.h[1];
	const u32 c = ctx1.h[2];
	const u32 d = ctx1.h[3];
	const u32 e = ctx1.h[4];
	u32 w0[4];
	u32 w1[4];
	u32 w2[4];
	u32 w3[4];
	w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
		| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
	w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
		| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
	w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
		| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
	w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
		| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
	w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
		| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
	w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
		| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
	w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
		| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
	w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
		| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
	w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
		| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
	w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
		| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
	w2[2] = 0;
	w2[3] = 0;
	w3[0] = 0;
	w3[1] = 0;
	w3[2] = 0;
	w3[3] = 0;
	sha256_update_64 (&ctx, w0, w1, w2, w3, 40); // prepend the internal sha256 hash.
	sha256_update (&ctx, s, salt_len); // add the salt
	sha256_final (&ctx); // finalize
	const u32 r0 = ctx.h[DGST_R0];
	const u32 r1 = ctx.h[DGST_R1];
	const u32 r2 = ctx.h[DGST_R2];
	const u32 r3 = ctx.h[DGST_R3];
	COMPARE_S_SCALAR (r0, r1, r2, r3);
}

The new main loop for our XenForo kernel.

Next up, we need to address the extra bytes available in sha256’s output:

const u32 a = ctx1.h[0];
const u32 b = ctx1.h[1];
const u32 c = ctx1.h[2];
const u32 d = ctx1.h[3];
const u32 e = ctx1.h[4];
const u32 a1 = ctx1.h[5];
const u32 b1 = ctx1.h[6];
const u32 c1 = ctx1.h[7];
u32 w0[4];
u32 w1[4];
u32 w2[4];
u32 w3[4];
w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((a >> 24) & 255) << 16;
w0[1] = uint_to_hex_lower8_le ((a >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((a >> 8) & 255) << 16;
w0[2] = uint_to_hex_lower8_le ((b >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((b >> 24) & 255) << 16;
w0[3] = uint_to_hex_lower8_le ((b >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((b >> 8) & 255) << 16;
w1[0] = uint_to_hex_lower8_le ((c >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((c >> 24) & 255) << 16;
w1[1] = uint_to_hex_lower8_le ((c >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((c >> 8) & 255) << 16;
w1[2] = uint_to_hex_lower8_le ((d >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((d >> 24) & 255) << 16;
w1[3] = uint_to_hex_lower8_le ((d >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((d >> 8) & 255) << 16;
w2[0] = uint_to_hex_lower8_le ((e >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((e >> 24) & 255) << 16;
w2[1] = uint_to_hex_lower8_le ((e >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((e >> 8) & 255) << 16;
w2[2] = uint_to_hex_lower8_le ((a1 >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((a1 >> 24) & 255) << 16;
w2[3] = uint_to_hex_lower8_le ((a1 >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((a1 >> 8) & 255) << 16;
w3[0] = uint_to_hex_lower8_le ((b1 >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((b1 >> 24) & 255) << 16;
w3[1] = uint_to_hex_lower8_le ((b1 >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((b1 >> 8) & 255) << 16;
w3[2] = uint_to_hex_lower8_le ((c1 >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((c1 >> 24) & 255) << 16;
w3[3] = uint_to_hex_lower8_le ((c1 >> 0) & 255) << 0
	| uint_to_hex_lower8_le ((c1 >> 8) & 255) << 16;
sha256_update_64 (&ctx, w0, w1, w2, w3, 64); // prepend the internal sha256 hash.

For real this time. Now with a1, b1, and c1.

In my case, it took a few attempts to get the bit shifting / byte ordering correct. Here’s what I did:

We know, ultimately, that w0, w1, w2, and w3 needed to hold the ASCII representation of the sha256 hash, using our example ‘peanut’:

# echo -n peanut | shasum -a 256 | xxd
00000000: 3535 3039 3834 3064 3038 3733 6164 6230 5509840d0873adb0
00000010: 3430 3535 3838 3832 3131 3937 6138 3633 405588821197a863
00000020: 3435 3031 3239 3334 3836 6336 3031 6361 4501293486c601ca
00000030: 3531 6531 3430 3633 6162 6532 3564 3036 51e14063abe25d06
00000040: 2020 2d0a

Grabbing the bytes for the intermediate sha256('peanut') step using xxd.

The correct array would look like this:

w0[0] = 0x35353039;
w0[1] = 0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;

Hardcoded array for debugging byte-order.

Using these hardcoded values, Hashcat passes the self-test… Progress!

We can play with byte ordering and bit shifting until w0[0] holds 0x35353039 (which, unless you’re shifting on a regular basis, isn’t natural):

w0[0] = uint_to_hex_lower8_le ((a >> 16) & 255) << 0
	| uint_to_hex_lower8_le ((a >> 24) & 255) << 16; //0x35353039;
w0[1] = 0x38343064;
w0[2] = 0x30383733;
w0[3] = 0x61646230;
w1[0] = 0x34303535;
w1[1] = 0x38383832;
w1[2] = 0x31313937;
w1[3] = 0x61383633;
w2[0] = 0x34353031;
w2[1] = 0x32393334;
w2[2] = 0x38366336;
w2[3] = 0x30316361;
w3[0] = 0x35316531;
w3[1] = 0x34303633;
w3[2] = 0x61626532;
w3[3] = 0x35643036;
printf(“\na = 0x%08x, w0[0]=0x%08x\n”, a,w0[0]); // debug

Everything but w0[0] is hardcoded. The printf helps us sort the right ordering.

You can debug the kernel by alternating rm -rf ./hashcat/kernels with your Hashcat single-hash test command e.g. ./hashcat -m 1415 -a 0 “00050655d5d6b8a8c14d52e852ce930fcd38e0551f161d71999860bba72e52aa:b4d93efdf7899fed0944a8bd19890a72764a9e169668d4c602fc6f1199eea449” peanut.txt. This is how I nailed down byte ordering and conversion.

With the right values in w0, w1, w2, and w3, we can move the *_sxx changes over to *_mxx. Just reapply your changes there while making sure the final comparison function is COMPARE_M instead of COMPARE_S.

Results

On an eight Nvidia 1080 GPU cracking rig from sagitta, we hit 605 MH/s with this unoptimized kernel — 7 times faster than CPU mode.

I’ll need to wait for another rainy day (and a corpus with a shorter salt) to hack on an optimized kernel.

You can find the final implementation at this GitHub Gist.