Popular Posts

Friday, June 23, 2017

Struggling Quant Episode 1: How I lost USD 500.000 while figuring out the link between questions, math, stats, coding and trading

Say that you are 30 years old and you have a good 25 years to work hard. Instead of going down the easy way of working for someone else during the day and killing time in the evenings and weekends, you have chosen the hard path of quantitative trading and started the heavy work. How many profitable trading ideas you can find in 25 years. Lets say you can figure out and test ideas in 10 days, 36 ideas a year and 900 ideas in 25 years right. Since this is my post, I am rounding 900 to 1000. Assuming a hit rate of 5%, 50 profitable ideas in 25 years, 2 per year and lets also say a profitable idea makes 20% per year. Depending on your initial capital, you will make X USD in 25 years. Lets say this X is USD 12.500.000 for me. USD 500.000 per year in average. This is my potential if I start doing things right now, right? So if I do things wrong for a year, I lose USD 500.000, this is what I did last year. I could not figure out how to operationalize all the scientific literature, huge quant trading content on the internet and coding and come up with a machine for trading research. We can argue that the time I invested will pay back, but at the end of the day, I have 24 years to go and no profits in my trading account, this is my reality. If you take quant trading seriously and you believe in your potential, I think you are losing some money as well, you better put your own numbers and do your own calculation.

I have spent a whole year trying to put the pieces together; learning math and stats, coding for strategy development, back testing and live trading, understanding market structure and trying to catch up with all that huge information flow from the blogs, websites, experts, sellers, marketers, academic research etc. This is my 2nd year in my journey to become a quantitative trader. I am still struggling a lot. I can not say that the picure is clear, but I think I am taking some steps forward. I have started to operationalize all this mass and doing things in a more time efficient and repeatable manner.

There are two reasons that I am posting this. The first one is that I realized that publishing something like this forces me to a structure which I need for success in trading. Second one is that I want to help you if you are going through a similar journey, I would be happy if I can accelerate things for you.

I do not have an institutional background in quant trading. I also do not have a degree on math, stats or computer science. I am a bachelor of economics, so I am trying to figure things out mostly from point zero. In the last one and a half years, I have learned to code in Python, I have developed my event driven back tester and live trading platform and I have started doing trading research using Python libraries. I have also invested a lot of time learning math and stats and I can say that i have an overall understanding of the methods that are applicable to quantitative trading. I only use price data and I focus on intraday strategies for major currency pairs. I now have three FX strategies that I paper trade with a demo account for four months. These strategies has generated a Sharpe ratio of 1.4 out of sample with a maximum drawdown of 18% and an average leverage of 5. They place an average of 1.5 trades a day. Paper trading performance is not as good as the out of sample test but not that off either, I take this as a good start.

Among my big list of my big struggles, linking coin flips (math and stats) to actual trading ideas was one of the hardest things for me to figure out. I mean, being able to operationalize all this methods and information flow and come up with a working machine for trading research was one of my biggest struggles. It may be easier for a person with a relevant background but for me this is like trying to fix a broken car with a tool set I do not know about and without a background with cars. Here you either need to have an understanding of the cars or an understanding of the tools to start doing something.

Going back to my challenge with putting the pieces together, information available for people starting this journey is either from car experts having an overall knowledge of the repair tools or from tool experts having an overall understanding of cars. It is very hard to find resources with a good balance between the car and the tools and touching at real life problems. Maybe these people, unlike me, all have the picture clear in their minds but not willing to give their secrets away, I do not know.

My interpretation is, these two groups have a common claim, which is, "here are the tools and methods that will make you find the answers". But what is an answer without a question, or without the the right question? Albert Einstein once said "If I had an hour to solve a problem and my life depended on the solution, I would spend the first 55 minutes determining the proper question to ask, for once I know the proper question, I could solve the problem in less than five minutes"

Long story short, my belief now is that, trading research should start with a good question (art) and this question should be answered with the scientific method (science). So the link between all relevant science literature and trading is good quality questions answered with the scientific method. This has been the glue for me, putting all the pieces together, very simple right? I can hear people saying that you are reinventing the wheel, quant trading is all about scientific method. I respect if you able to operationalize all this and came up with a working machine for trading research. This has been very hard for me to realize since it is very easy to lose sight of the big picture when going through such a complex adventure. Some can easily spend years reading and thinking on stats and math, coding and reading articles without a clear picture and plan on what needs to be done. My personal experience was mostly around trying to apply standalone statistical or econometric models to price data for finding edges or researching known trading ideas with my own not so structured methods, both were ended up with failures.

I will try to walk you through my way of developing trading strategies using the scientific method with the help of a simple example. Along the way, I will try to provide formula free explanations for basic math and stats concepts applicable to this example. Focus here will be on the method and the proper usage of the tools rather than the trading idea itself, please do not take the trading idea seriously.


Everything starts with an observation and a question, these two can switch places sometimes. You can observe market activity, macroeconomic factors and other things and come up with a question. You can do data digging, realize something that you can relate to a real life phenomena and ask a question. Or you can simply read an academic article where all the cycle for the scientific method is given with a conclusion and you like to apply your own thinking to the observation on hand. I can also argue that if we assume that a given data set is able to answer a limited number of questions, then we can claim that data science is for finding the relevant questions, not the answers. Anyway.

I have this simple observation.

Observation: Japanese session is low activity for European currencies, when European session starts, there is an increase in activity, volatility kicks in, volume goes up and the game of the day starts.

Then I ask a question. This is only one way of asking a question in relation with this observation. Quality of the observation and the question is the most important part of the entire picture.

Question: Can the initial direction of European currencies at European session opening predict price movements through out the session for the same currency set?

Now I need to create a hypothesis, which is actually my answer to the question that I just have asked. My educated guess. All that math and stats you are trying to digest is mostly used here while experimenting if this hypothesis is true or not. The hypothesis should be a testable answer to the question.

Hypothesis: The direction of the European session opening for European currencies is an indication of the direction for the session closing. Or, I can say, opening direction tend to persist through out the session.

Critical point here is that, a hypothesis needs to have two things, a dependent variable which is the thing that you are trying to predict and an independent variable which is the information that you are planning to use to make this prediction.

dependent variable: direction of the European currency at session closing (session direction)

independent variable: direction or the European currency at session opening (opening direction)

So lets prepare some data and start. You do not need to know how to code or do the coding to follow, just read. I have chosen to publish this with the codes to provide new starters a feel of how it looks like.

In [316]:
#do necessary imports

#this is my vectorized backtester for quick and dirty backtesting of research findings
import bulk_tester as bt

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Get EURUSD M15 bars for the back testing period(BT - 01.01.2009 to 01.01.2014) with no resampling(original)

get_data is a function that I created for loading price data into a Pandas dataframe as it is or with resampling to higher time frames

In [323]:
dftest = bt.get_data('EURUSD', 'BT', 'original')

Add some calculated columns to the dataframe that we will be necessary. I need changes in volume and returns to be able to measure increasing market activity and calculate the values for my dependent and independent variables.

In [324]:
#absolute percentage price change
dftest['abs_pct_return'] = abs(dftest.Bid.pct_change())
#percentage volume change
dftest['vol_change'] = dftest.volume.pct_change()
#percentage return
dftest['pct_return'] = dftest.Bid.pct_change()
#pips change
dftest['pips_change'] = 10000*(dftest.Bid - dftest.Bid.shift(1))

dftest = dftest.dropna()

I first want to validate my observation that Japanese session is low activity and the action starts with the European session for EURUSD. An easy way of seeing this is to create a graph where average absolute return and average volume plotted against 15 minute intervals. To be able to get such a graph I need to create a new column for hour and minute together so that I can group absolute returns and volume by this new column and get the average.

In [326]:
#this is a function taking a string as an input and returning a string with a zero at the first character
#if lenght of the input string is one.
def fix(string):
    if len(string) == 1:
        string = '0' + string
    return string
#add a new column in HH:MM format
dftest['hour_min'] = dftest.index.hour.map(str).map(fix) + ":" + dftest.index.minute.map(str).map(fix)

Visual check shows that there is a shift in volume around 06:00 - 08:00 GMT.

In [327]:
<matplotlib.axes._subplots.AxesSubplot at 0x13352cfd0>

Same shift is also the case for absolute returns.

In [328]:
<matplotlib.axes._subplots.AxesSubplot at 0x1293f75c0>

Also checking some days picked randomly, I can say that increase in activity starts at 07:00 GMT, so lets take 07:00 return as the independent variable. Dependent variable is the session return, this can be calculated at 12:45, last bar before NY session starts to isolate European session or at 16:45 where the European session ends but overlapping with the US session starting from 13:00. So lets take 12:45 as the session ending and isolate our analysis to European session.

At this point, I have an observation, I have a question and I have defined my testable hypothesis along with my dependent and independent variables. I have also validated my observation with data. So what is next, I now need to setup an experiment to test my hypothesis.

To make the picture clear, I also need to point out that our hypothesis which we have created in the general flow of the scientific method is not the same as hypothesis we use in the context of statistical tests. To be able to link our hypothesis with the hypothesis testing in statistical sense, we need an additional layer of refinement. However, before doing this I will try to provide a high level view of the probability theory in relation to our example.

Why we need probability theory, because it defines the mathematical model for handling uncertain situations like the one we have here. If we are willing to answer our questions with data testing our hypothesis we need to setup an experiment and run this experiment with a probabilistic model in hand. To be able to create a probabilistic model, we need to understand its rules.


A probabilistic model requires a properly defined sample space and a probability law and these two should be in line with some rules and axioms. This means that we need to make clear all possible outcomes(sample space) and the probabilities of these outcomes (probability law) and follow some rules doing this.

The sample space: In probability language, calculating the opening and closing returns for a day is called the observation. Here the experiment is a sequential observation involving looking at the returns once at session opening and once at session closing for a specific number of days. I can define my experiment as follows "observe opening and closing directions of EURUSD for 1000 days"

Number of possible outcomes of these observations for a given day is 4; open up - close up, open up-close down, open down-close up and open down-close down. The set of all possible outcomes is called the sample space, in our case the set of all possible outcomes is 4000, 4 per day and for 1000 days in total. Important thing here is that, sample space is not a set of actual observations, it is the set of all possible observations. We are not experimenting anything yet, we are just defining the space in which we will be experimenting. A subset of the sample space, which is a collection of a group of outcomes is called an event. So getting open up and close down in a given day is an event for which we can calculate probabilities. In summary, we have an experiment with 4000 possible outcomes that makes up the sample space.

A sample space should be mutually exclusive and collectively exhaustive. Mutual exclusivity in our case is in a given day, open up and open down can not happen at the same time. Collectively exhaustive means that all the possible outcomes are defined in the sample space and in a given day, we can not see any other outcome than open up, open down, close up and close down. For example if a day comes up with a session opening where 07:00 close is the same as 06:45 close, meaning that opening was flat, this outcome is not included in the sample space, hence our sample space is not collectively exhaustive. We need to define up or down with a <= or a >= sign to cover all possibilities.

Once we have properly defined our sample space, we need to start assigning probabilities to different events to be able to do some calculations. Remember, events are a grouping of outcomes, a subset of the sample space.

The probability law: Probability law is a definition of the likelihood of outcomes or events. Here we assign probabilities to outcomes and events. Lets look at our case. What we are interested here is the probabilities of four events; open up, open down, close up, close down. Remember, events are groupings of outcomes, so when we say open up, it is the set of all possible outcomes with an open up.

  • event here is open up
  • number of possible outcomes that makes this event happen is say 2000, just assume all outcomes are equally likely
  • total number of possible outcomes in the sample space is 4000
  • the probability of this event (open up) is 2000/4000 = 50%

This is called the Discrete Uniform Law which is, with the assumption that all outcomes are equally likely, probability of an event A is, P(A) = number of elements of A / number of elements in the sample space.

A probability law should obey the following rules:

  • probability of an event should be non negative. This makes sense as an event is a collection of possible outcomes and if an outcome is possible it should have positive probability. In our example, there should be a positive probability for an outcome in a given day as there will inevitably be a direction at the session opening and closing
  • probability of disjoint events is the sum of the probabilities of that events. Disjoint events means that they do not share common outcomes. In our experiment, if we define open up as an event and open down as an event, these are disjoint
  • total probability of the entire sample space should add up to 1. In our experiment, sum of the probabilities of all possible 4000 outcomes should sum up to 1

Remember, we are just setting up the theory here, we did not do any observations yet, we are just making some assumptions and trying to define some rules. If you think why do not we just go and count the days with open up and divide it by 1000 to come up with a probability, this is the actual experiment, this is not something you can use for generalization. This is why we do all those equally likely kind of assumptions. Rules governing sample space and probability law are the same for all experiments. We are not yet doing any counting here.

So, what is the link between probability theory and the experiment in hand, how do we do the experiment. The link is the concept of random variables. By definition, a random variable is a real valued function of the outcome of an experiment. I hate such definitions. What does this mean? In our case, lets define a random variable which takes the value of one if we observe open up and takes the value of zero if we observe an open down. Here the random variable takes the direction of the open (which is my observation) as an input and gives us a real value one or zero. Give random variable an observation, it gives you a number. Why do we need a number, because we know how to deal with numbers. A random variable maps the sample space to real numbers given an observation and a definition of how this mapping should be done. By the way, I have just invented the Bernoulli random variable, it is this easy to be famous.


Ok, I am out of ammo. This post will be followed by posts on the following:

  • go through joint/marginal/conditional probabilities and independence and calculate these for our example
  • build up on random variables and describe where probability distributions sit in the picture using our example
  • link these back to our hypothesis and define the actual statistical hypothesis test
  • talk about logistic regression as one of the tools for modelling binary outcomes
  • revisit all this discussion with Bayesian thinking
  • finalize the scientific method cycle and come up with a trading rule (maybe, if we reject the null hypothesis)
  • backtest the trading rule and close the discussion

If you find this post valuable, please also like this on Quantocracy, just click the arrow left to the post!


  1. What steps did you take to learn Python and mathematics for quant trading?

  2. Hi Jacob,

    When I first started, it was not clear for me what to focus, which markets, which instruments, which trading styles, which trading frequency, all were question marks. So I did a bulk reading and watching of everything that I found related for like 6 months. Then this cloud of information started to make sense. Then I decided where to focus, intraday FX for majors, fully quantitative and online. The reasons being FX trading is low cost, liquid, open 5/24, no constraints on short selling etc and easy to connect and automate. Very hard to find edges btw. And also I have made a decision on using price data only and avoiding advanced methods like machine learning which tend to overfit.

    When you make these choices, it is much more easier to come up with a workout plan for learning Python and math/stats. For example, I did not include that much stochastic processes and options pricing theory, machine learning, high frequency strategies (I can never have the infra anyway)

    So the steps you may chose to follow:
    - go to MIT OCW and start with Prof. Tsitsiklis on Probability, watch and digest all the videos and lecture notes, it is also on youtube
    - do the same for time series analysis with Prof. Mikusheva, links are available in my blog
    - after doing these, start reading Quantstart, Ernest Chan and Jonathan Kinlay to start linking the theory and practice. You will find great python coding in Quantstart
    - in parallel start listening chat with trader podcasts
    - in parallel start doing basic coding with python but do not go into data science libraries yet, keep it basic, just google introduction to python, but do this with emphasis to pandas and numpy libraries
    - put an average of 3 hours work per day for 6 months
    - then make your choice, where will you focus?
    - then create your workout plan tailored for your choice

    I hope this helps.


  3. Hi Jacob, I try to do similar research as you on index futurex using an application called Investor R/T which has nice statistical features.

    What are your thoughts on probability of consecutive events? Say you have an event which occurs 25% of the time over some large sample size. This event has now occured 3 days in a row; is the probability that it will occur a 4th day in a row 0.25^4?

    I've been thinking about this a lot and i'm not sure what the answer is. I think it depends on whether the events are dependent or independent of each other, however I feel that isn't easy to answer either. Perhaps when markets trend the behavior is more "dependent" and when they are consolidating/noise the behavior is more independent.

    1. Hi Stefan,

      So you have an event with a probability of 0.25, this is something you observe daily and you are looking for the probability of this happening 4 days in a row. If there is independence you just multiply the individual probabilities as you are suggesting.

      Two events are said to be independent if knowing one does not provide information on knowing the other. Like knowing the outcome of your observation day 1 does not provide additional information on your observation on day 2. You know day 1 or not, day 2 probability is the same. In probability theory terms, conditional probability is equal to the unconditional probability.

      When you say when the markets are trending, you are imposing a new condition and this is a concept called conditional independence or dependence. When you say markets are trending your sample space now is narrowed down to all possible outcomes limited to trending markets so formerly independent events may be dependent now or vice versa, you need to check the following condition:

      P(AnB/C) = P(A/C) x P(B/C)

      Check this out https://www.youtube.com/watch?v=19Ql_Q3l0GA

      Btw, I do not know the problem in hand but there may also be dependence like autocorrelation in time series data.

      At one point you have to make assumptions to make life easier:)



    2. Correction:

      Two events are said to be independent if knowing one does not provide information on the other.