Prediction Idol Competition 

Data for the competition was obtained from Edgar database which included  tens of  daily variables and  over 200  fundamental variables. Database goes back to 1998 and covers over 30,000 stocks. In addition to available parameters students implemented 12 technical idnicators such as MACD, RSI, ROC etc and created extra database fields for each of these indicagtors for each secutiry and date.  We have restricted ourselves to NASDAQ and NYSE stock exchanges and overall we have data covering 10,000 stocks over last ten years. 

 Competition Rules

 Your job is to define the strongest possible predictors of growth  of a stock overall all the  past stock histories available in our database.  In other words your predictors would have to had the best average "return" for the data available in the database.  Of course if history is predictor of the future, your predictors may be useful for future data as well. But we are not making any such claims. It is strictly about the best peformance till now. Rules  you will come up with will of course suffer from overfitting the data. The rules will be "as they are", holding for the data available so far without any gurantees that they will hold in the future.  This is just the begining!         

Predictors will be CONJUNCTIONS  of conditions expressed as (Attribute Operator Value), where Attribute is one of the technical or fundamental variables. Operator can be “equality” or “greater than” (>) (less than, <) comparison and value is an element of the domain of attribute. Attributes should not include TIME (neither as quarter nor as  week, day etc. Attributes should also not include names o securities or their derivatives (i.e. all stocks beginning with “AK”). 


There are 4 categories of competition

Sprint

Predict stock prices within exactly 5 trading days. Your predictor should trigger at least 1000 times for all possible pairs of security x date (around 10 millon such data points in our database).  This means that you predictor will trigger on average twice a week (data covers around 500 weeks).    

Uphill Sprint

Predict stock prices within exactly 5 trading days. Your predictor should trigger at least 100000 times for all possible pairs of security x date (around 10 millon such data points in our database).  This means that you predictor will trigger on average 40 times a day  (data covers around 2000 trading days).    

Long Distance

Predict stock prices within exactly 20 trading days. Your predictor should trigger at least 1000 times for all possible pairs of security x date.    


Uphill Long Distance

Predict stock prices within exactly 20 trading days. Your predictor should trigger at least 100000 times for all possible pairs of security x date (around 10 millon such data points in our database).   


The list of additional technical indicators is here Indicator List. You can use  any machine learning methods from weka or otheriwse. You may also use a hybrid method and tinker with the predictions manually using our database interface since every predictor is a simple sql query against the edgra competition database.

RESULTS


Summary: we had 12 students enter 4 predictors each and we had live competitoin in class where each of the predctors was run in real time against the database. It felt a bit like a  
Below we list medals, names, average returns (5day and 20 day depending on the competition) plus support - the number of pairs (security, date) when the predictor triggered. 

Sprint

Gold:      Bobby         16.5884251135487, 1004
Silver     Zhiyuan       16.4994591534424, 1040
Bronze:    Michael    16.1467334367557, 1006

Winning Predictor -- "Hodgepodge"

SELECT AVG(ret5d), COUNT(1)

FROM edgar.competition_5d

WHERE EMA1_PRICE > 1.26

  AND STOCHRSI < 0.98

  AND MACD < -0.5

  AND NETWORKINGCAPITAL > -3000000000

  AND INVENTORIESNET > 0

  AND EMA_MACD < 0

  AND STDDEV > 0.54

  AND CANDLE_EVNSTR < 1

  AND BGRBANDS_UPPER > 2.6

  AND EMA2_PRICE > 1.903

  AND PPO_HIST > -7.8

  AND DEGREEOFFINANCIALLEVERAGE > -100;

 



Uphill Sprint

 Gold     Bobby       2.57625426927653, 100072
 Silver    Srividya    2.34234795556897, 100424
 Bronze    Zhiyuan   2.31555101285443, 105665

Winning Predictor:"Technicalities"


SELECT AVG(ret5d), COUNT(1)

FROM edgar.competition_5d

WHERE BGRBANDS_UPPER > 1.579

  AND STDDEV > 0.129

  AND CANDLE_PIERC < 1

  AND ROC > -98

  AND CANDLE_MRNSTR < 1

  AND CANDLE_EVNSTR < 1

  AND EMA1_PRICE > 0.9981478

  AND AROONDOWN > 5

  AND EMA2_PRICE > 1.3029064

  AND AROONUP < 100;

Long Distance

Gold   Michael       38.5623798035737, 1029
Silver  
Zhiyuan      34.5523125083674, 1155
Bronze   Qiang        32.7588295045704, 1028

Winning predictor:   

SELECT count(*),avg(ret20d) FROM edgar.competition_20d p WHERE  StochK  <23.2932 and RSI  >16.6412 and RSI  <35.7178 and WilliamsR  <-88.961 and CCI  >-297.677 and CCI  <-108.664 and StochRSI  <0.4028 and StdDev  >0.19351 and BgrBands_Lower  >0.55275 and BgrBands_Lower  <1.3425 and BgrBands_Upper  >1.8106 and MACD  <-0.31276 and EMA_MACD  <-0.28368 and PVO  >-13.2484 and PVO  <40.218 and ROC  <-42.7578 and PPO  <-18.2597 and PPO_EMA  >-48.9628 and PPO_EMA  <-15.7338 and PPO_HIST  >-2.8042 and PPO_HIST  <8.7226

 



Uphill Long Distance

Gold      Bobby        7.8265423375417, 100002
Silver    Qiang          6.22577383970266, 100542
Bronze   Srividya     6.13336246454681, 100047


Winning Predictor  - "Volitile Employees"

SELECT AVG(ret20d), COUNT(1)

FROM edgar.competition_20d

WHERE NUMBEROFEMPLOYEES > 0

  AND STDDEV > 0.079597

  AND PERIODLENGTH > 2

CONCLUSIONS and OBSERVATIONS

Given the past, which rules performed the best?  Financial writers often  make statements about bullish or bearish interpretations of various indicators such as MACD, Bollinger Bands etc. These statments start with the word "usually". Given the data we have we can substanitate/validate or refute such statements. For example is MACD > 0 a bullish indicator? we can look at the past data and see of a security which satisfied MACD >0 had positive or negative return later. In other words we can substantiate claims

We will monitor rules generated in class against the incoming data. We will also address the overfitting concerns using newly proposed notions of  rule sensitivity and  limited crossvalidation which we believe is appropriate for the temporal data feed. Stay tuned.     .