There’s a new unsupervised learning technique in town and they’re called AutoEncoders. AutoEncoders are a strange mix of supervised and unsupervised learning - allow me to explain: They work by training data only of a specific kind such hand writing images of the number ‘4’ or regular air port travellers going about their buisness. Their goal is to reduce the demensionality of the data and then reconstruct it with a minimum reconstruction error. So why is this useful for anomoly detection? Well, if you put the number ‘5’ inside an algorithm that’s only been trained on the number ‘4’ - the reconstruction error will be HIGH. This same principle could work to stop criminals and trafficers - by detecting patterns that we may not be aware of yet. This is one new application of unspervised learning that will be very useful in the future.
Statistics is about what you can quantify. What can you say about your information and how confident are you about that statement? This makes the interpretablility of your information curical. The most intretable model is regression. This is the model where ML and statistics cross the most. In regression, we use a gradient to minimize a ‘cost function’ so the slope that goes through the points has the lowest error. This is easily interpretable because we usually understand what those data points reprsent, and the slope is reasonable.
For machine learning, we don’t care how interpetable a model is, as long as it works. A good metric for measuring how good a model is, is an ROC-AUC score. This tells how good a model is at telling the difference between two things. If the ROC-AUC score is 50%, this means your model is no better than a coin flip at guessing classes. If your model has a score of 90% that means you model is performing very well. ROC-AUC measures how much better your model is than random guessing.
Machine learning can pick up patterns in information that is hard to find, and can improve themselves without human intervention.
Statitics can unviel insight into information. You may see a difference in two datasets - is this difference significant?
Machine learning deals primairly with making predictions where as stastics is used for infrence about two or more variables.
Machine learning is all about the results and statistics is about the relationships between variables.
The first part of making a website work with flask is pretty simple - you write your code, you run the command in the terminal and it works. The part where it’s uploaded on the internet can get complicated fast. The tutorials on flask dont expcet you to be uploading anything with tensor flow. So I had to dig deep to find the one thing flask tutorials dont mention, Apt -Files. This is what tells Heroku what software packages are required for the website to work. Another thing that might go over looked is the Procfile. The Procfile is a simple text file that tells heroku that youre making a web app. Everything else was basically following the upload file format and only allowing .wav files in the file extensions. https://intense-oasis-99800.herokuapp.com/
Machine learning involves a class of algorithms that automate model building. This is based on the idea that machines can regonize patterns and make predictions with little human intervention. Where as statistics is used to gain insight from information based on interpretable results. Machine learning is more like a soup of data - you dont really care about the insight you get from the information - often times it’s near impossible to make an easily interpretable ML model. ML models are good at making predictions however. But as with every sort of ML model, trash going in - trash going out.
For my gender predicting project, I was deciding between what ML model to use. Random Forest - which builds decision trees independently and averages the results at the end - or Gradient Boosted Trees which boosts the weak predictors as it goes along and totals the results at the end.. or maybe KNN or Logistic Regression? Well to find out which was the best - I used a ML pipeline which trys each model with different tuning parameters and then shows the results. This worked, but none of my results gave me an accuracy near 80 or an f1 score that made me confident in the model.
So in the end, I used a voting classifier which weighed the reults of each model and ‘voted’ based on those results. This gave me an accuracy of 85% and an F1 of 91%.
Overall I was happy with the results - there was a class imbalance issue because more men were arrested than women - but, I felt if I did any upsampling or downsampling - it would scew the results the other way. The bias inherited in these models isn’t as big of an issue because men get arrested more than women anyways - so the same distribution will occur.
Convolutional Nueral Networks are useful for classifying objects that can be defined by their topology. The reults arent very interpretable but they are powerful. CNNs use convolution to detect lines and edges. One issue with CNNs is that they require a tensor thats a fixed size. This makes them good at dealing with images but not with time series information. Audio does have topological properties. Sound produces pressure waves at various frequencies. With a fourier transform - these can be visualized geometrically. Ok, but now we have another problem - audio samples come in different amounts of time. The way to make it interpretable to a CNN is to either - take the sample amount of time of each sample - or - you can average the amplitude of the frequencies - so every imported sample is a single array value of frequncies. This is what I did categorize drum samples. Because drum samples are shorter - there is less room for errors via averaging over a long time period. I got an overall weighted average of 98%. For my next project it will be cool to try the same thing but with LSTMs.