Showing posts with label: machine_learning. Show all posts.

I am still working on my face autoencoder in my spare time, although I have much less spare time lately. My non-variational autoencoder works great - it can very accurately reconstruct any face in my dataset of 400,000 faces, but it doesn't work at all for interpolation or anything like that. So I have also been trying to train a variational autoencoder, but it has a lot more difficulty learning.

For a face which is roughly centered and looking in the general direction of the camera it can do a somewhat decent job, but if the picture is off in any way - there is another face off to the side, there is something blocking the face, the face is at a strange angle, etc it does a pretty bad job. And since I want to try to use this for interpolation training it on these bad faces doesn't really help anything.

One of the biggest datasets I am using is this one from ETHZ. The dataset was created to train a network to predict the age of the person, and while the images are all of good quality it does include many images that have some of the issues I mentioned above, as well as pictures that are not faces at all - like drawings or cartoons. Other datasets I am using consist entirely of properly cropped faces as I described above, but this dataset is almost 200k images, so omitting it completely significantly reduces the size of my training data.

The other day I decided I needed to improve the quality of my training dataset if I ever want to get this variational autoencoder properly trained, and to do that I need to filter out the bad images from the ETHZ IMDB dataset. They had already created the dataset using face detectors, but I want to remove faces that have certain attributes:

  • Multiple faces or parts of faces in the image
  • Images with something blocking part of the face
  • Images where the faces are not generally facing forward, such as profiles

I started trying to curate them manually, but after going through 500 images of the 200k I realized that would not be feasible. It would be easy to train a neural network to classify the faces, but that would require training data, but that still means manually classifying the faces. So, what I did is I took another dataset of faces that were all good and added about 700 bad faces from the IMDB dataset for a total size of about 7000 images and made a new dataset. Then I took a pre-trained discriminator I had previously used as part of a GAN to try to generate faces and retrained it to classify the faces as good or bad. 

I ran this for about 10 epochs, until it was achieving very good accuracy, and then I used it to evaluate the IMDB dataset. Any image which it gave a less than 0.03 probability of being good I moved into the bad training dataset, and any images which it gave a 0.99 probability of being good I moved to the good training dataset. Then I continued training it and so on and so on.

This is called weak supervision or semi-supervised learning, and it works a lot better than I thought it would. After training for a few hours, the images which are moved all seem to be correctly classified, and after each iteration the size of the training dataset grows to allow the network to continue learning. Since I only move images which have very high or very low probabilities, the risk of a misclassification should be relatively low, and I expect to be able to completely sort the IMDB dataset by the end of tomorrow, maybe even sooner. What would have taken weeks or longer to do manually has been reduced to days thanks to transfer learning and weak supervision!

Labels: coding, data_science, machine_learning, pytorch, autoencoders
No comments

PyTorch Update

Thursday 18 April 2019

After another couple of weeks using PyTorch my initial enthusiasm has somewhat faded. I still like it a lot, but I have encountered many disadvantages. For one I can now see the advantage of TensorFlows static graphs - it makes the API easier to use. Since the graph is completely defined and then compiled you can just tell each layer how many units it should have and it will infer the number of inputs from whatever it's input is. In PyTorch you need to manually specify the inputs and outputs, which isn't a big deal, but makes it more difficult to tune networks since to change the number of units in a layer you need to change the inputs to the next layer, the batch normalization, etc. whereas with TensorFlow you can just change one number and everything is magically adjusted.

I also think that the TensorFlow API is better than PyTorch. There are some things which are very easy to do in TensorFlow which become incredibly complicated with PyTorch, like adding different regularization amounts to different layers. In TensorFlow there is a parameter to the layer that controls the regularization, in PyTorch you apparently need to loop through all of the parameters and know which ones to add what amount of regularization to.

I suppose one could easily get around these limitations with custom functions and such, and it shouldn't be surprising that TensorFlow seems more mature given that it has the weight of Google behind it, is considered the "industry standard", and has been around for longer. But I now see that TensorFlow has some advantages over PyTorch.

Labels: python, machine_learning, tensorflow, pytorch
1 comments

PyTorch

Monday 08 April 2019

When I first started with neural networks I learned them with TensorFlow and it seemed like TensorFlow was pretty much the industry standard. I did however keep hearing about PyTorch which was supposedly better than TensorFlow in many ways, but I never really got around to learning it. Last week I had to do one of my assignments in PyTorch so I finally got around to it, and I am already impressed.

The biggest problem I always had with TensorFlow was that the graphs are static. The entire graph must be defined and compiled before it is run and it can't be altered at runtime. You feed data into the graph and it returns output. This results in the rather awkward tf.Session() which must be created before you can do anything, and which contains all of the parameters for the model.

PyTorch has dynamic graphs which are compiled at runtime. This means that you can change things as you go, including altering the graph while it is running, and you don't need to have all the dimensions of all of the data specified in advance like you do in TensorFlow. You can also do things like change the numbers of neurons in a layer dynamically and drop entire layers at runtime which you can't do with TensorFlow.

Debugging PyTorch is a lot easier since you can just make a change and test it - you don't need to recreate the graph and instantiate a session to test it out. You can just run an optimization step whenever you want. Coming from TensorFlow that is just a breath of fresh air.

TensorFlow still has many advantages, including the fact that it is still an industry standard, is easier to deploy and is better supported. But PyTorch is definitely a worth competitor, is far more flexible, and solves many of the problems with TensorFlow.

Labels: python, machine_learning, tensorflow, pytorch
No comments

CatBoost

Thursday 10 January 2019

Usually when you think of a gradient boosted decision tree you think of XGBoost or LightGBM. I'd heard of CatBoost but I'd never tried it and it didn't seem too popular. I was looking at a Kaggle competition which had a lot of categorical data and I had squeezed just about every drop of performance I could out of LGBM so I decided to give CatBoost a try. I was extremely impressed.

Out of the box, with all default parameters, CatBoost scored better than the LGBM I had spent about a week tuning. CatBoost trained significantly slower than LGBM, but it will run on a GPU and doing so makes it train just slightly slower than the LGBM. Unlike XGBoost it can handle categorical data, which is nice because in this case we have far too many categories to do one-hot encoding. I've read the documentation several times but I am still unclear as to how exactly it encodes the categorical data, but whatever it does works very well.

I am just beginning to try to tune the hyperparameters so it is unclear how much (if any) extra performance I'll be able to squeeze out of it, but I am very, very impressed with CatBoost and I highly recommend it for any datasets which contain categorical data. Thank you Yandex! 

Labels: coding, data_science, machine_learning, kaggle, catboost
2 comments

Exercise Log

Tuesday 27 November 2018

I exercise quite a lot and I have not been able to find an app to keep track of it which satisfies all of my criteria. Most fitness trackers are geared towards cardio and I also do a lot of strength training. After spending a year trying to make due with combinations of various fitness trackers and other apps I decided to just write my own, which could do everything I wanted and could show all of the reports I wanted.

I did that and after using it for a few weeks put it online at workout-log.com. It's not fancy and it is quite likely very buggy at this point, but it is open to anyone who wants to use it. 

It's written with Django and jQuery and uses ChartJS for the charts. 

Labels: python, django, data_science, machine_learning
1 comments