Data Science in the Cloud: Networking Tips and Tricks
I have worked on teams which primarily utilize local servers, remote servers, GCP as well as AWS. Working on all these platforms I’ve learned a few tricks that can boost your productivity. Here I’ll outline four of my most useful tricks for mastering the mythical networking and working in the cloud.
- Long-running scripts on remote machines
- Running Jupyter notebooks remotely
- Accessing load balancers in a private network through a bastion
- Review: Proxy vs Port Forwarding
Long-running scripts on remote machines
One of the most common problems when moving from a local machine to a remote server is keeping a long running job from exiting unexpectedly due to a connection error or something on your local machine terminating the ssh connection. There are a number of solutions to this but I must say that I think the best is the use of a tool called
tmux. Let’s walk through the basics of installation, setup and workflow.
Installation and setup
tmux is a terminal emulator and multiplexer. Installation is available through most major package providers so on Ubuntu you can use
sudo apt-get install tmux. There are many good introductions to tmux such as this one but I’ll try to cover the basics here.
As for configuration, the
tmux defaults are cumbersome but you can customize almost any of it. I like to use this config to get started. That repository has lots of goodies but the easiest way to get going is to just copy the
.tmux.conf into your home directory and you’re ready to go with some much more intuitive shortcuts.
After installation you can start a new tmux session from the command line with simply
tmux. This will drop you into a new shell which is now emulated. If your SSH connection were to be terminated you could reconnect to this session and it would be exactly as you left it, even if you have programs running they will continue to run! Now for the basics, it is often useful to split a single terminal window into multiple terminals, this way you won’t have to ssh from multiple local terminals.
While in tmux commands are often described as
prefix + cmd, this indicates pressing the
prefix key sequence and then some command key. The default prefix sequence is the cumbersome
ctrl + b. If you’ve followed my advice above you’ll also have the less difficult
ctrl + a available as a prefix. So to split a terminal horizontally you can enter
prefix + - and vertically is
prefix + |. Note that these are not the default mappings, I don’t even want to mention the defaults, they’re not intuitive… Did I mention you should just use that config?
When you’re done in a tmux session you can detach from it and return to the hosts terminal with
prefix + d. Then you can list all running sessions with
tmux ls. This will show your running sessions (which you can also rename). To reattach to a session, if you haven’t renamed it, you can use
tmux attach -t 0. This is extremely helpful when running long scripts remotely. Note one caveat to tmux is that scrolling is not handled very well, if you want to be able to look back through logs I typically send them to a file as well as
tee which can be done simply with
python my_long_script.py | tee file.log.
Running Jupyter notebooks remotely
This is an extremely common use case and a logical next step. From above, we know how to run a script on a remote machine without keeping the connection open but what about if I want an interactive Jupyter notebook running on my remote machine which I can connect to and run commands at will, and of course those commands need to continue to run and complete if I lose my connection. This is entirely possible but I will add this one caveat, Jupyter notebooks do not handle output to lost connections very well. So if your connection is dropped, or terminated, you will lose your output printed to the screen. For this reason I typically use the method above to run long-running scripts to process the data and then load the data into a Jupyter notebook for visualization and exploration.
Same as above we’re going to make sure to run our notebook from within a tmux session so that it stays alive but how do we connect? Port forwarding! Port forwarding is a more advanced SSH technique that allows you to “tunnel” or “forward” ports between your local machine and your remote machine. There are two directions to this,
Local Forwarding forwards ports from your local machine to the remote machine while
Remote Forwarding forwards ports from the remote machine to your local machine. So for viewing a Jupyter notebook on a remote instance we want to use
Local Forwarding to forward our local browser on port 8888 to the remote machine.
To set this up, you need to forward the port when you connect initially, this can be done with:
Then you can just open a browser and visit
localhost:8888 and you’ll see your Jupyter notebook. But let’s also break down this command because this will be helpful for more advanced situations such as if you need to create this tunnel through a gateway.
More Advanced Setups
The first 8888 is the local port on your local machine, the second part,
localhost:8888 is the target of the port forward on the remote machine. Finally,
my-jupyter-box is the address you’re ssh-ing to. So let’s say you have a gateway you need to ssh to and then you want to create a tunnel through the
gateway box to another box,
jupyter-box. Also, let’s say that you’re running a local Jupyter notebook so you don’t want to block access to that so we’ll use a different port. For this scenario you could use:
So with this command we’re going to ssh to
gateway and forward any local traffic to port
8889 through gateway and to
jupyter-box:8888. Now you can open your browser and visit
localhost:8889 and you should see your notebook!
One last extra nugget I recommend is using an ssh config which is just a text file at
~/.ssh/config. I won’t go through all the possibilities but this config file is a very powerful way to setup shortcuts to these complicated ssh commands so you don’t have to remember them. For example I always have one that looks something like:
With this saved in my ssh config file I can then just type
ssh dev-box and I will be connected to my development box with local forwarding of 8888 for jupyter notebooks and remote forwarding of 52698 for
rmate remote editing of text files, another favorite tool of mine.
BONUS: Using a virtual environment with jupyter notebooks
This isn’t really network related but it is a great trick. Virtual environments are a great asset to the Python data scientist and essential to my workflow. I even use them with my Jupyter notebooks and you can too.
Now in your Jupyter notebook you should have a kernel called
my_venv connected to your virtual environment. Neat! Note that these kernels are stored in
~/Library/Jupyter/kernels on OSX and
~/.local/jupyter/kernels on Ubuntu.
Accessing load balancers in a private network through a bastion
This one doesn’t come up all that often but when it does it can be really handy. Let’s say I’m working with a production environment and so we have great security and we have some internal load balancers, maybe some monitoring, which is running in a private VPC or behind a firewall. All I really want to do is pull up a web browser of a load balancer or host but it’s hidden in this private VPC, what am I to do? Well we can use our port forwarding to forward our traffic through a bastion or gateway machine. The command looks like this:
This command is very simple but powerful. First the
-N switch just tells ssh not to run any commands remotely, so all this command does is forward ports. Second, the
-D specifies that I want to forward all traffic on port
8157 to the remote host. Now all I have to do is proxy my traffic through port
8157 and it will be routed to the remote machine. Note that this proxy can be configured within your browser or with an extension like foxyproxy for Firefox. I always keep a foxyproxy config setup to route traffic through
8157 in case I need to use this trick.
Review: Proxy vs Port Forwarding
Note that the end result of to
-D flag is different from the port forwarding. If my internal host is called
my-host and it is serving traffic on port 4000 I will enter
my-host:4000 in the address bar with my proxy configured. Note that I’m forwarding all traffic to this host on any port through 8157. As you can tell these capabilities are overlapping, in that I could use a proxy to access a Jupyter notebook or I could access
my-host:4000 internally as described above with
ssh -L 80:my-host:4000 bastion and then entering
localhost in the address bar of a browser with no proxy. I think port forwarding is best suited for single ports which you need access to, such as a Jupyter notebook while the proxy trick is great for seeing inside the lens of a private network. With that I can access any different endpoints I might need to see to diagnose a problem without closing and opening connections.
Leave a Comment