Tuesday, December 19, 2017

digging around in machine code and a licence violation

Hi,
today I will tell you how I found a violation of the license of LLVM by an international software company with 50+ employees and how I found out how to circumvent their disassembly protections for their plugin system.

Due to legal concerns I will not name the company or the software here and just refer to the software as "the program" and also not name the real version numbers to avoid any legal problems. Furthermore I would like to mention here, that I have never agreed to any end-user license agreement, because the binaries of the program were available on some github-account and so no installation was necessary.

At first let me shortly tell you why I looked at this program: A friend of mine uses it to do financial analysis. This program has a plugin infrastructure, that allows you to write your own indicators and automatic trading systems. A it is usual in the financial sector people want to earn money. So program includes the possibility to only distribute (sell) the compiled plugins without the source code. (They actually put pretty heavy measures in place to prevent people from learning what a compiled plugin actually does.) So my friend came across such a compiled indicator and wanted to learn about its functionality.
Here a short list of reasons, why I publish this article:

  • They actually violate some conditions of the LLVM license with their plugin infrastructure. (Details below.)
  • It should be possible to learn about what some compiled program does when I execute it on my computer.
  • If financial decisions are based on the output of some indicator it should be possible to analyse it (and if needed improve it or warn other people about dangerous stuff in it). 

So the goal is to transform a compiled plugin file to readable text (at least assembler, i.e. disassembled) or even understandable text (C++ code or some pseudo code, i.e. "decompiled").

There are actually several major versions of this program (A, A', B) and the file formats of the compiled plugins (a,b,c):

  • program version A (ancient) produces plugins of type a 
  • program version A' (old, but still heavily in use, same major version as A) produces plugins of type b 
  • program version B (new, incremented major version) produces plugins of type c
For plugins of type a there is a tool to "decompile" compiled plugins. As it turned out the compiled file that we want to understand is actually of type b. The company has implemented extremely heavy countermeasures starting with version A' and type b. Type b and type c seem to differ only by a version string at the beginning of the file.

Here one example of how they try to protect their program code and compiled plugin files from disassembly: The binaries do not actually contain the real machine code, but only a small startup routine that later decrypts/decompresses the main part of the machine code from data sections in the binary. This is basically self-modifying code and a popular technique among malware programmers to hide the functionality of their creations from analysis tools such as regular disassemblers. 
Even though the code contains all the functionality and all necessary information to decrypts/decompresses the actual machine code, in order to analyse it one first would have to understand the decryption/compression used and re-implement it. (But here is a much easier way, because they made a mistake! See below!)
Additionally they try to prevent debugging, i.e. the process of running a code step by step to learn its functionality. The conventional anti-anti-debugger tricks unfortunately did not work here.
The compiled plugin files are not constant in time meaning that if you compile the same code file at different points of time they will generate very different compiled files. Also neither the some structure nor string constants are visible in any way in the compiled files. 

So we have here three different forms in which we can have a plugin:
  1. the source code
  2. the compiled machine code
  3. the transformed and somehow compressed and/or encrypted (with timecode) plugin file: let's call it "encrypted" for simplicity here
So we have the following situation: We have a plugin compiler that creates encrypted files, that we can't understand. The main program can read these encrypted files, transform then back to machine code and execute them. Both programs can't be simply read in a disassembler or run in a debugger.....


But here is where their mistake comes in: The main program keeps the whole back-transformed machine code in memory instead of deleting it after execution!

I noticed that when I dumped the whole "physical" memory of the Windows virtual machine where I ran my analysis and dug trough it to find the actual machine code of the program because I could not sleep. A much simpler way to accomplish the same is to directly dump the memory of a process in Windows directly from the task manager.

Because I could not find the content of the "encrypted" plugins files in the memory dump, I tried some experiments with some integer numbers and simple structures in plugins, that I compiled with the plugin compiler and then loaded into the main program. Finding the exact integer numbers in the memory dump let me discover the the locations and formats of the compiled machine code of the plugins in the memory dump.

The "magic" numbers to look for are: "55 89 E5 83 EC 08" or in assembler (i386):

  • 55 = push ebp
  • 89 E5 = mov esp, esp
  • 83 EC 08 = sub esp, 0x8
This are EXACTLY the first three assembler instructions that are executed at the start of a C-function when compiled with clang or gcc (with -O0 -m32).

So now we have found the compiled machine code of a plugin in the memory dump. We can now extract the functions and either diassemble them the conventional cumbersome way or use the recently released RetDec from Avast (https://github.com/avast-tl/retdec) to decompile it to readable C-code!

LLVM-license violation:
When digging trough the memory dump I found that the actual machine code of the plugin compiler has some strange strings in it, for example "After splitting live range around basic blocks" and "Number of branches unswitched". A quick internet search revealed that this is actually code from LLVM: 




"Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimers in the documentation and/or other materials provided with the distribution."

But neither the documentation on the website nor the distributed files seem to contain this copyright notice and the other stuff necessary. To me this is a clear violation of the license.