Getting Started with Python for Data Science
This is a reference post for steps to set up an environment.
- Setting up Python
- Setting up virtualenv
- Packages to Install
- Standard imports
- Setting up jupyter notebook with virtualenv
- Jupyter Notebooks in the cloud
Setting up Python
Here I’ll cover the basics of setting up a data science development environment on ubuntu/debian. First you’ll want to install python and pip
Setting up virtualenv
Now you have a basic python setup but we want to make sure we keep python packages separate from system packages so we’ll make sure we use a virtual environment. Virtual environments have lots of benefits and allow you to manage the requirements of code for deployment.
You’ll now see (venv)
next to your terminal meaning you’re working within your virtual environment. Let’s get some packages installed.
Packages to Install
Standard imports
There are some standard rules for how we import packages and it helps to see these ahead of time.
Note that most packages here have a defined standard abbreviation which we will follow and that for scikit-learn functions we will typically import each function individually.
Setting up jupyter notebook with virtualenv
Now we have an environment but jupyter notebooks are a great tool for exploring data. How can we now connect our jupyter notebook to our virtualenv packages?
Once this has been done we can start our jupyter notebook server with jupyter notebook
and connect. From there you should see my_project_venv
as an available kernel.
Jupyter Notebooks in the cloud
Often times my work will be done on more powerful servers which are not my local machine. There are a few tips and tricks I use for working with these.
Enabling password protection
More than likely you’ll want to enable password protection, especially if you follow the below step of keeping your notebook running. To do this you just need to generate a password hash and insert it into your jupyter config. First we will generate the password hash in ipython:
When you run the passwd
function you’ll be prompted to create a password and it will create a hash, copy this hash. Now we need to add this hash to the jupyter config. Generate a config file with jupyter notebook --generate-config
. Now add the following to the bottom of the config at ~/.jupyter/jupyter_notebook_config.py
This is modified from jupyter configuration on AWS
Connecting to a remote jupyter notebook
My preferred method of using ssh is with the ~/.ssh/config
which allows for me to configure a server and have all of my settings there whenever I want to connect. I’ll simply add the following to my ~/.ssh/config
file for the server I want to connect to:
Now from my machine I can just execute ssh dataBox
and I will be connected to my data machine and once I start my jupyter notebook on the remote machine I can just visit 127.0.0.1:8888
in a web browser and work on my notebook running on the remote machine.
Keeping Notebooks running
First you’ll want to install tmux
, this is a terminal emulator tool which is very powerful and I highly recommend looking into it further but for our purposes we’ll be using it to keep our jupyter notebook session running.
Once tmux is installed you can start your tmux notebook from within tmux and if your ssh session gets disconnected your notebook will still be running. All you have to do is reopen your ssh connection and your notebook will be connected again. I also recommend using this config.
Rendering on a windowless server
If you run into problems plotting on a remote server, you’ll see an error about $DISPLAY
not being defined, you need to configure matplotlib’s backend. Add the following to the top of your notebook, in this order:
Leave a Comment