COS513-Finance

Packages

Note: some of our machines have libc version 2.12, so we need to use numpy<=1.7.1

scipy gensim nltk pandas iso3166 matplotlib sklearn

Ideas

PRIORITY #1: Get training to run on ionic.princeton.edu
PRIORITY #2: remove glob argument from sample for simplicity.
Scale the date-summary array in glm.py
Add another feature (number of events per day, N). Also check that lowering scaling by N doesn't make feature floats too small - handle N = 0 case
Cross-validation (see paper)
Adaboost (see paper)
Extend W2V corpus to have more words, add it as a feature
Currently, the clustering columns (topic-columns) are not scaled/normalized (the were before, on a single-day scope, which is inacurate). Consider using the random sample from clustering.py to generate a pre-processin psuedo-normalization which scales the columns according to a sample mean and sample sd (note MLE bias correction there), both prior to generating the kmeans clusters and before doing a classification.
Smarter cluster sampling - not just 150 lines from each day... have a python script do this?
Try other commodities
Try SVM classifier
Other linear classifiers: http://scikit-learn.org/stable/modules/linear_model.html - GLM, RANSAC, Bayesian
Try linear regression on the return proportions (p[t+1]-p[t])/p[t] in glm.py
Use the HMM for up/down classifictation, add its output as another feature
Smarter clustering: GMM, IGMM, HDP, SNGP. Two approaches here:
- Apply the dynamic clustering algorithm as a nonparametric model to our random sample. This is most similar to our current pipeline. It would produce an "intrinsic" number of clusters, but this would still be a static clustering pipeline in that the model will eventually grow stale.
- Take on a fully bayesian, fully dynamic approach. Add priors to all hyperparameters (and non-hyper parameters - this includes both the linear model and the clusters), and then start on a 0- or 1-cluster prior. Run through our training data, making daily updates to the model (should result in incremental slope changes).
  - Reasoning: if we have a dynamic number of clusters, then we need to somehow have a variable number of input features to our linear model. To do this, we'll need to have bayesian regression that learns every day (since it would be able to adapt to learn the new slopes as it gains more clusters). This is completely different to train, but is completely dynamic as well. There's no meaningful transition from a static training set to a test set (and with priors over regularization constants there's no need for validation at all). Instead, every day, we just do a bayesian update on the number of clusters, the slopes associated with the features for a given cluster, etc.
Get more data, for years <= 2013. Need to convert to YYYYMMDD.export.CSV format. On historical data, we'll need to check the DATE_REPORTED column to recover the same info. Make sure preprocessing.py can handle these differences.
Introduce periodicity features (this is day i of period P - how to learn P?)

Name		Name	Last commit message	Last commit date
Latest commit History 553 Commits
bbg-datasets		bbg-datasets
exploration		exploration
ghassen results		ghassen results
launch-xargs		launch-xargs
leveldb		leveldb
paper-latex		paper-latex
presentations		presentations
quote-download		quote-download
.DS_Store		.DS_Store
.gitignore		.gitignore
ACF.PNG		ACF.PNG
Capture.PNG		Capture.PNG
GDELT_file_scraper.py		GDELT_file_scraper.py
KellyRun-Random.png		KellyRun-Random.png
KellyRun.nb		KellyRun.nb
PCA.PNG		PCA.PNG
Pred820.PNG		Pred820.PNG
README.md		README.md
Thumbs.db		Thumbs.db
ag-lm-1.csv		ag-lm-1.csv
clustering.py		clustering.py
convert_pickle.py		convert_pickle.py
examine_clusters.py		examine_clusters.py
expand.py		expand.py
forex_analysis.R		forex_analysis.R
forex_preprocess.py		forex_preprocess.py
full-pipeline.sh		full-pipeline.sh
get_summary_stats.py		get_summary_stats.py
glm.py		glm.py
grid_search.py		grid_search.py
igmm10k.export.CSV		igmm10k.export.CSV
igmm_10000_20000101.export.CSV		igmm_10000_20000101.export.CSV
infinite_gmm.py		infinite_gmm.py
lasso-ag-1.csv		lasso-ag-1.csv
launch_all_slurm.sh		launch_all_slurm.sh
launch_all_slurm_igmm.sh		launch_all_slurm_igmm.sh
linear_reg.py		linear_reg.py
logistic_reg.py		logistic_reg.py
model_linker.py		model_linker.py
parameter_search.py		parameter_search.py
plot_results.py		plot_results.py
preprocessing.py		preprocessing.py
random_10k_expanded		random_10k_expanded
random_sample.py		random_sample.py
recluster.py		recluster.py
sftp-config.json		sftp-config.json
summarize.py		summarize.py
svm.py		svm.py
train.csv		train.csv
train_word2vec.py		train_word2vec.py

Alex223124/COS513-Finance

Folders and files

Latest commit

History

Repository files navigation

COS513-Finance

Packages

Ideas

About

Resources

Stars

Watchers

Forks

Languages