Python for Algorithmic Trading

Using Machine & Deep Learning for Algorithmic FX Trading

Dr Yves J Hilpisch | The AI Machine

http://aimachine.io | http://twitter.com/dyjh

Imports

In [1]:
import math
import tpqoa
import cufflinks
import numpy as np
import pandas as pd
from pylab import plt
plt.style.use('seaborn')
%matplotlib inline
cufflinks.set_config_file(offline=True)
In [2]:
import warnings
warnings.simplefilter('ignore')

Oanda for FX Trading

Why Oanda and why FX?

  • technology and APIs first in algorithmic trading
  • proper APIs and good Python wrapper packages
  • low transaction costs and simple cost model
  • fully symmetric markets (long/short)
  • high liquidity and long trading hours
  • high leverage possible but not required
  • all typical order types available (trailing stop, etc.)
  • basically all single instrument strategies straighforward to trade
  • pair and basket strategies also possible
  • free data — both historical and streaming
  • full data history for all instruments
  • good trading apps (phone, pad, mac, win, browser)
  • ...

The Data

In [3]:
api = tpqoa.tpqoa('dyjh.cfg')
In [4]:
api.get_instruments()[:4]
Out[4]:
[('CAD/JPY', 'CAD_JPY'),
 ('Platinum', 'XPT_USD'),
 ('SGD/CHF', 'SGD_CHF'),
 ('CAD/CHF', 'CAD_CHF')]
In [5]:
sym = 'EUR_USD'
In [6]:
raw_a = api.get_history(sym, '2019-02-04', '2019-02-06', 'M1', 'A') 
In [7]:
raw_b = api.get_history(sym, '2019-02-04', '2019-02-06', 'M1', 'B') 
In [8]:
raw_a.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2804 entries, 2019-02-04 00:00:00+00:00 to 2019-02-05 23:59:00+00:00
Data columns (total 6 columns):
c           2804 non-null float64
complete    2804 non-null bool
h           2804 non-null float64
l           2804 non-null float64
o           2804 non-null float64
volume      2804 non-null int64
dtypes: bool(1), float64(4), int64(1)
memory usage: 134.2 KB
In [9]:
sel = list('c')
In [10]:
spread = (raw_a['c'] - raw_b['c']).mean()
spread  # average spread
Out[10]:
0.00013838088445078468
In [11]:
data = ((raw_a[sel] + raw_b[sel]) / 2)
In [12]:
ptc = spread / data['c'].mean()
ptc  # mean spread relative to mean mid price
Out[12]:
0.00012104865307308787
In [13]:
data.head()
Out[13]:
c
time
2019-02-04 00:00:00+00:00 1.145710
2019-02-04 00:01:00+00:00 1.145770
2019-02-04 00:02:00+00:00 1.145650
2019-02-04 00:03:00+00:00 1.145555
2019-02-04 00:04:00+00:00 1.145360
In [14]:
data['c'].plot();

Efficient Markets

In [15]:
lags = 7
In [16]:
cols = []
for lag in range(1, lags + 1):
    col = 'lag_{}'.format(lag)
    data[col] = data['c'].shift(lag)  # lagged prices
    cols.append(col)
In [17]:
data.dropna(inplace=True)
In [18]:
reg = np.linalg.lstsq(data[cols], data['c'], rcond=-1)[0]
In [19]:
np.set_printoptions(precision=4)
In [20]:
reg
Out[20]:
array([ 9.8156e-01, -6.1465e-04, -1.4863e-02,  5.5695e-02, -1.4918e-02,
       -2.9290e-02,  2.2428e-02])
In [21]:
pd.DataFrame(reg, index=cols).plot(kind='bar');

Patterns Defined

Investopedia writes:

Chart patterns look at the big picture and help to identify trading signals — or signs of future price movements.

The theory behind chart patterns is based on this assumption — that certain patterns consistently reappear and tend to produce the same outcomes.

The process of identifying chart patterns based on these criteria can be subjective in nature, which is why charting is often seen as more of an art than a science.

In [22]:
data['r'] = np.log(data['c'] / data['c'].shift(1))
In [23]:
cols = []
for lag in range(1, lags + 1):
    col = 'lag_{}'.format(lag)
    data[col] = data['r'].shift(lag)  # lagged returns
    cols.append(col)
In [24]:
data.dropna(inplace=True)
In [25]:
data[cols] = np.where(data[cols] > 0, 1, -1)
data[cols] = data[cols].astype(int)
In [26]:
data.head(5)
Out[26]:
c lag_1 lag_2 lag_3 lag_4 lag_5 lag_6 lag_7 r
time
2019-02-04 00:15:00+00:00 1.145790 1 1 1 -1 -1 1 1 1.309149e-05
2019-02-04 00:17:00+00:00 1.145790 1 1 1 1 -1 -1 1 2.220446e-16
2019-02-04 00:18:00+00:00 1.145870 1 1 1 1 1 -1 -1 6.981838e-05
2019-02-04 00:19:00+00:00 1.145810 1 1 1 1 1 1 -1 -5.236333e-05
2019-02-04 00:20:00+00:00 1.145655 -1 1 1 1 1 1 1 -1.352846e-04

Frequency Approach

Simple

In [27]:
data['d'] = np.sign(data['r']).astype(int)
In [28]:
data.groupby(cols[:2])['d'].count()
Out[28]:
lag_1  lag_2
-1     -1       779
        1       709
 1     -1       709
        1       592
Name: d, dtype: int64
In [29]:
data.groupby(cols[:2] + ['d'])['r'].count()
Out[29]:
lag_1  lag_2  d 
-1     -1     -1    365
               0     39
               1    375
        1     -1    339
               0     36
               1    334
 1     -1     -1    341
               0     41
               1    327
        1     -1    293
               0     35
               1    264
Name: r, dtype: int64
In [30]:
(data.groupby(cols[:2] + ['d'])['r'].count() / len(data) * 100).round(2)
Out[30]:
lag_1  lag_2  d 
-1     -1     -1    13.09
               0     1.40
               1    13.45
        1     -1    12.15
               0     1.29
               1    11.98
 1     -1     -1    12.23
               0     1.47
               1    11.72
        1     -1    10.51
               0     1.25
               1     9.47
Name: r, dtype: float64

Advanced

In [31]:
cols[:3] + ['d']
Out[31]:
['lag_1', 'lag_2', 'lag_3', 'd']
In [32]:
grouped = data[cols[:3] + ['d']].groupby(cols[:3] + ['d'])
In [33]:
res = grouped['d'].size().unstack()
In [34]:
res
Out[34]:
d -1 0 1
lag_1 lag_2 lag_3
-1 -1 -1 184 25 195
1 181 14 180
1 -1 186 21 175
1 153 15 159
1 -1 -1 176 21 178
1 165 20 149
1 -1 148 18 161
1 145 17 103
In [35]:
res['prob_up'] = (res[1] / (res[1] + res[-1])).round(3)
res['prob_down'] = 1 - res['prob_up']
In [36]:
res
Out[36]:
d -1 0 1 prob_up prob_down
lag_1 lag_2 lag_3
-1 -1 -1 184 25 195 0.515 0.485
1 181 14 180 0.499 0.501
1 -1 186 21 175 0.485 0.515
1 153 15 159 0.510 0.490
1 -1 -1 176 21 178 0.503 0.497
1 165 20 149 0.475 0.525
1 -1 148 18 161 0.521 0.479
1 145 17 103 0.415 0.585

Classification

In [37]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

Logistic Regression

In [38]:
lr = LogisticRegression(solver='lbfgs', multi_class='auto')
In [39]:
lr.fit(data[cols], data['d'])
Out[39]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)
In [40]:
y_lr = lr.predict(data[cols])
In [41]:
accuracy_score(y_lr, data['d'])
Out[41]:
0.4836859089279312

Gaussian Naive Bayes

In [42]:
nb = GaussianNB()
In [43]:
nb.fit(data[cols], data['d'])
Out[43]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [44]:
y_nb = nb.predict(data[cols])
In [45]:
accuracy_score(y_nb, data['d'])
Out[45]:
0.4808174973108641

Support Vector Machine

In [46]:
kernels = ['linear', 'rbf', 'poly']
In [47]:
models = {}
for kernel in kernels:
    svm = SVC(C=5, kernel=kernel, gamma='auto')
    svm.fit(data[cols], data['d'])
    y_svm = svm.predict(data[cols])
    acc = accuracy_score(y_svm, data['d'])
    print('kernel: {:8s} | accuracy: {:6.3f}'.format(kernel, acc))
    models[kernel] = svm
kernel: linear   | accuracy:  0.485
kernel: rbf      | accuracy:  0.551
kernel: poly     | accuracy:  0.513
In [48]:
models
Out[48]:
{'linear': SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 'rbf': SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 'poly': SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False)}

Deep Neural Network

In [49]:
dnn = MLPClassifier(hidden_layer_sizes=3 * [96], activation='relu',
                    max_iter=2500, verbose=False)
In [50]:
%time dnn.fit(data[cols], data['d'])
CPU times: user 9.95 s, sys: 38.9 ms, total: 9.99 s
Wall time: 1.68 s
Out[50]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=[96, 96, 96], learning_rate='constant',
       learning_rate_init=0.001, max_iter=2500, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)
In [51]:
y_dnn = dnn.predict(data[cols])
In [52]:
accuracy_score(y_dnn, data['d'])
Out[52]:
0.5532448906418072

Vectorized Backtesting

NO TRANSACTION COSTS | ONLY IN-SAMPLE PERFORMANCE</style>

In [53]:
data['p'] = models['rbf'].predict(data[cols])
data['s_svm'] = data['r'] * data['p']
In [54]:
(data['p'].diff() != 0).sum()
Out[54]:
1268
In [55]:
data['p'] = dnn.predict(data[cols])
data['s_dnn'] = data['r'] * data['p']
In [56]:
(data['p'].diff() != 0).sum()
Out[56]:
1444
In [57]:
data[['s_svm', 's_dnn', 'r']].cumsum().apply(np.exp).plot();