top of page

Deep Reinforcement Learning Frameworks in Cryptocurrency Market Making: Research Paper Discovery and Analysis


Author: Jonathan Sadighian, Amin Fadaeddini, Babak Majidi & Mohammad Eshghi, Sergey Nasekin, and Cathy Yi-Hsuan Chen

Paper 1

Deep Reinforcement Learning in Cryptocurrency Market Making

The purpose of market makers is to provide liquidity to a certain market, which facilitates transactions and the health of that market. In cryptocurrency markets, centralized exchanges have no market makers, and liquidity is provided by participants. This paper aims to use Limit Order Book (LOB), Trade Flow Imbalance (TFI), and Order Flow Imbalances (OFI)to generate an informed deep reinforcement learning market-making agent.

The observation space of the agent consists of three smaller sub-spaces.

  1. The environment state space (ESS). This consists of snapshots of the LOB, TFI, and OFI with a lookback window of w.

  2. The agent state space (ASS). This consists of risk and position indicators.

  3. The agent action space (AAS). It consists of the agent’s last action.

Two reward functions are evaluated in this DRL framework.

  1. Total Profit-and-Loss (PnL).

  2. Trade completion.

Each function returns some value r which is the signal the agent uses to learn.

Two advanced policy-gradient methods were used as market-making agents — A2C and PPO. Both of these are on-policy model-free actor-critic algorithms. These algorithms don’t need to model the environment and incorporate a function estimate that “criticizes” the agent’s policy.

Paper 1
Deep Reinforcement Learning in Cryptocurrency Market Making
Results 1

The trade completion reward function yielded greater and more stable returns than the positional PnL because trade completion agents completed more trades. When applied to different currencies, the agents generated similar results. Some agents performed better than their same-currency counterparts, which suggests that LOB exhibits universality.


Using only limited order book data and trade and order flow indicators, the agent is able to generate profitable daily returns without prior knowledge of how the market-making process is performed.

Moreover, the DRLMM framework enables the agent to generate profitable daily returns on different cryptocurrency data sets, highlighting the agent’s ability to generalize effectively and learn the non-linear characteristics of its observation space.

These are the benefits.

  1. Immutability means that data cannot be changed by third parties.

  2. Consensus removes the possibility of corruption, tampering, and censorship.

  3. Decentralized security, cryptography, and lack of single failure point.

  4. No downtime.

Paper 2

Extending Deep Reinforcement Learning Frameworks in Cryptocurrency Market Making

This paper builds upon the previous one by investigating the effect of different reward functions. It also introduces a new, price-based event into the environment.

Reward functions


The reward functions are categorized as profit and loss, goal-oriented or risk-based approaches. The PnL-based rewards are below.

  1. Unrealized PnL

  2. Unrealized PnL with realized fills.

  3. Asymmetrical unrealized PnL with realized fills.

  4. Asymmetrical unrealized PnL with realized fills and ceiling.

  5. Realized PnL change

The goal-based reward is Trade completion.

The risk-based reward is Differential Sharpe Ratio.

Price-Based Events

Usually, in reinforcement learning, the agent steps through the environment using a time-based interval. For financial trading, this interval can be anything from seconds to days, depending on the strategy. However, the most common approach in market-making is to use tick events (new, cancel or modify order) as a trigger for the agent to take action. In practical applications, though, this tick-based approach isn’t optimal — due to partial executions, latency, and other factors. This paper introduces to price-based events as a trigger for the agent to interact.

A price-based event is a move in the midpoint price greater than a certain predefined threshold.


The PnL-based reward functions encouraged near-sighted trading behavior (uPnL , speculative trading behavior (rPnL)or tactical trading behavior, which failed to exploit large price movements (auPnL).

The trade completion reward function resulted in more active trading and inventory management. The differential Sharpe ratio was very inconsistent and volatile. However, the authors concluded that this approach has the potential to provide better results through a better parameter search.

Time-based events were harder for agents to learn, with roughly 18% of experiments being profitable. However, these agents (with the trade completion reward function) were able to achieve the highest returns as they could react quicker to adverse price movements.

On the other hand, price-based events were easier to learn for the agents, with roughly 27% of experiments returning a profit. This approach reduces noise in the data, as agents will only react during price changes. The resulting trading strategies of the agents were more stable and less erratic during large price jumps.


An A2C agent with a goal-based trade completion reward function generated the greatest return for both times- and price-based environments. The authors noted that significant price jumps occurred during their testing, affecting profitability. They suggest that convolution, attention, and recurrent neural networks could help agents learn to exploit these price jumps better.

Paper 3

Secure Decentralized Peer-to-Peer Training of Deep Neural Networks Based on Distributed Ledger Technology

This paper proposes a secure decentralized peer-to-peer framework for training deep neural network models. Current methods for obtaining large datasets to train neural networks are costly and difficult. Distributed ledger technology provides a solution to this problem.

Because of blockchain’s ability to tokenize any asset, it is possible to use blockchain to share neural network models. The framework proposed in this paper would not share data publicly, rather, the model would be trained on the data locally, and the trained parameters of the model would be shared instead. The authors use the Stellar blockchain as the infrastructure to build their framework because of its in-built DEX, and it is open-source.

The proposed framework has four types of asset tokens.

  1. Deep Learning Model (DLM)is issued by model initiators and is distributed to validators who have been assigned the task of verifying trained models.

  2. Verified Learned Model (VLM), which is also issued by model initiators to validators.

  3. Deep Learning Coin (DLC), which computing partners must pay to cooperate in training procedures. The motivation behind this interaction is to make hostile participants need to pay more if they detriment training.

  4. Stellar’s native token (XLM) is paid to computing partners.


A computing participant is incentivized to train models on the network through Stellar’s native token. Once the computing partner finishes the training procedure, the trained model goes to validator nodes for validation. A pre-arranged amount of XLM will then go to the partner’s account. The framework also deters malicious acts through a reputation-based scheme.

Case Study

The authors implemented their proposed framework to conduct decentralized training of autonomous cars. The traditional centralized approach collected driving data from the cars in public and sent it back to a central organization to train the model. This process involves private users giving away their data. A decentralized approach, however, would enable training without any data leaving the custody of the user. With this approach, it is also possible to distribute the model to all the participants.

There are three types of participants in the network.

  1. Model owners who issue the model.

  2. Data owners collect data and use it to train the model.

  3. Validators who validate trained models.

In the context of autonomous training cars, the data owner is the car owner, and the car has onboard capabilities to train the model. After a certain distance has passed, the car downloads the latest checkpoint of the model and trains it in real time using the data it has collected. The trained model is then re-uploaded to the network to be validated. The participant who trained the network is then rewarded for their work.

This paper showed that distributed training is possible, however, they noted that the entire framework is not on-chain. Rather, some aspects are off-chain such as the trained models stored on IPFS. It still shows that rigorous training with large datasets is not solely possible through centralized systems, and a legitimate incentivization scheme is possible in a decentralized ecosystem that also discourages malicious acts.

Paper 4

Deep Learning-Based Cryptocurrency Sentiment Construction

This paper studied investor sentiment of cryptocurrency using recursive neural networks  RNN . The authors used a body of messages collected from StockTwit to gauge representative opinions on cryptocurrency trends. The authors used RNN to learn long-term semantic and syntactic dependencies in messages.


RNN are used in language processing because they have a form of “memory” while training. More specifically, in this paper, the authors used Long Short term memory and gated recurrent unit schemes of RNNs to overcome the problem of “vanishing gradients” (which causes a basic RNN to fail to capture long-period dependencies in sentences, for example, when words are spaced far apart but are still relevant to each other).

Conclusion 1
Paper 2
Extending Deep Reinforcement Learning Frameworks in Cryptocurrency Market Making
Reward functions
Price-based events
Results 2
Conclusion 2
Paper 3
Incentivization 3
Case Study 3
Deep Learning-Based Cryptocurrency Sentiment Construction
Paper 4
Architecture 4
Secure decentralized peer-to-peer training of deep neural networks based on distributed ledger technology

However, messages cannot be input straight into the model, as they need to be represented in vector form. To do this, the authors used the Word2Vec model, proposed in other research in 2013, to create embeddings of message inputs. this model is a shallow neural network that attempts to predict context words given an input word.

Data Preprocessing

Text messages on StockTwits required light preprocessing to make them interpretable by the embedding layer.

Data Preprocessing 4

Supervised training is enabled by StockTwit’s feature of classifying messages as bullish or bearish. Because bearish messages only constituted 16% of the overall dataset, the authors applied to oversample to bearish messages in the training subset of the data.

Estimation Results

Performance metrics of the model’s ability to predict messages as bullish or bearish are below. The LSTM setup with a pre-trained Word2Vec embedding layer performed the best based on precision and recall.

0_DDul-8fvKcOyH-ow (1).png
Estimation Results 4

The authors then attempted to build an aggregate cryptocurrency sentiment index that would represent the cryptocurrency market’s opinion.

They proposed that sentiment extracted from StockTwits may contain predictive value relative to the market’s volatility. They produced a sentiment-driven volatility model capable of capturing the actual fluctuations of absolute returns of the cryptocurrency index.

bottom of page