Sunday, May 21, 2017

Machine learning 11 - Visualize high dimensional datasets

When we are dealing with machine learning datasets, many times, we have higher dimensional data than just the easy 2 dimensions. This makes us difficult to visualize the data to get a sense how different dimensions have a relationship with each other, or is there a hidden structure inside it. Today, I will show you the ways I usually use to visualize the higher dimensional datasets. You can find all the script at Qingkai's Github
I summarize ways to visualize high dimensional data into 2 groups:
  1. Using algorithms to reduce dimension
  2. Clever way to plot
Let's first see how to use algorithms to visualize the data. In this blog, we will use the IRIS dataset to show how different methods work. 

Let's first load data

from sklearn import datasets
# import the IRIS data
iris = datasets.load_iris()
iris_data = iris.data
Y = iris.target

print('There are %d features'%(iris_data.shape[1]))
print('There are %d classes'%(len(set(Y))))
There are 4 features
There are 3 classes

1 Reduce dimension using algorithms

1.1 Visualize high dimensional data with PCA

Principal Component Analysis is the classical way to reduce the dimensions. In our case, we have 4 features, which means we have 4 dimensions, difficult to visualize. With PCA, we can plot the first two components, and get a sense of the patterns hidden behind the data. 
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('seaborn-poster')
# Let's do a simple PCA and plot the first two components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(iris_data)

# plot the first two components
plt.figure(figsize = (10, 8))
plt.scatter(X_pca[:, 0], X_pca[:,1], c = Y, s = 80, linewidths=0)
plt.xlabel('First component')
plt.ylabel('Second component')
<matplotlib.text.Text at 0x111681e10>
png
The above figure is showing the first two components of the PCA. I colored the dots with the 3 classes so that we can see the hidden structures. 

1.2 Visualize high dimensional data with t-SNE

t-distributed stochastic neighbor embedding (t-SNE) is a nonlinear dimensionality reduction technique that is particularly well-suited for embedding high-dimensional data into a space of two or three dimensions, which can then be visualized in a scatter plot.
from sklearn.manifold import TSNE
X_tsne = TSNE(learning_rate=100).fit_transform(iris_data)
# plot the first two components
plt.figure(figsize = (10, 8))
plt.scatter(X_tsne[:, 0], X_tsne[:,1], c = Y, s = 80, linewidths=0)
plt.xlabel('First dimension')
plt.ylabel('Second dimension')
<matplotlib.text.Text at 0x111a993d0>
png

2 Clever way to plot

Also, there are many clever ways to plot the data so that we can get a sense of the data. Pandas is the package I usually use for visualize high dimensional data. Here are some examples from pandas visualization:
import pandas as pd
import numpy as np
# let's first put the data into a dataframe
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= [x[:-5] for x in iris['feature_names']] + ['target'])
df.head()
sepal lengthsepal widthpetal lengthpetal widthtarget
05.13.51.40.20.0
14.93.01.40.20.0
24.73.21.30.20.0
34.63.11.50.20.0
45.03.61.40.20.0

2.1 Scatter plot matrices

We know that scatter plot is a great tool to visualize the relationship between two variables, if we put every two variable pairs into a scatter plot and make them into a nice matrix, it is the scatter plot matrices. From these plots, we can easily see if a pair of variables related to each other. 
from pandas.plotting import scatter_matrix
scatter_matrix(df[df.columns[[0, 1, 2, 3]]], diagonal = 'density')
png

2.2 Parallel Coordinates

Parallel coordinates are a common way of visualizing high-dimensional geometry and analyzing multivariate data. It allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together.
from pandas.plotting import parallel_coordinates
parallel_coordinates(df, 'target')
<matplotlib.axes._subplots.AxesSubplot at 0x128946590>
png

2.3 Andrews Curve

Andrews curves is another way to visualize structure in high-dimensional data. It is basically a smoothed version of parallel coordinates. 
from pandas.plotting import andrews_curves
andrews_curves(df, 'target')
<matplotlib.axes._subplots.AxesSubplot at 0x129006ad0>
png

2.4 Radviz

Radviz is another way of visualizing multi-variate data. It is based on a simple spring tension minimization algorithm. Basically you set up a bunch of points on a plane. In our case they are equally spaced on a unit circle. Each point represents a single attribute. You then pretend that each sample in the data set is attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to unit interval). The point in the plane, where our sample settles to (where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will be drawn. Depending on which class that sample belongs it will be colored differently.
from pandas.plotting import radviz
radviz(df, 'target')
<matplotlib.axes._subplots.AxesSubplot at 0x12890f5d0>
png
These are usually my ways to visualize the high dimensional data to get a quick sense before I build further models. If you have better ways, please let me know.

Sunday, May 14, 2017

No time to blog

I was moving to a new place in the last two weeks, mostly during the weekend, you can see why I didn't update my blog, maybe I should learn the skills showing in the following picture:


Anyway, I hope I am finally back now with more time to blog interesting things, stay tuned.


Sunday, April 23, 2017

Machine learning 10 - Funny pictures

The following are funny pictures related to machine learning or data science I found online. It is a great way to learn some concepts in this way. Also, it is nice to use some of them in your talk so that students can learn something in a very relaxed environment. Hope you also enjoy them! 

Torture your data

jpg

Big data is like teenage sex

jpg

No trust of large uncertainty

png

Correlation does not imply causation

jpg

Type I and Type II error

jpg

Out lier!

jpg

Machine learning protests

jpg

Different views of machine learning

png

Hisotry of life

jpg

Your product vs. apple and google

png

Good code vs. bad code

jpg

How deos overfitting looks like?

jpg

Developers are born brave

jpg

It never rains in the Bay Area

jpg

Deep learning is easy

jpg

Interview a data scientist

jpg

Change your habit

jpg

Label the axis

jpg

Information, Knowledge, and Creativity

jpg

The deep learning Saga

Acknowledgements

All the figures are from internet, I thank all the authors. I found a lot of the pictures from the following links. 

Sunday, April 16, 2017

Wife's painting: Tar sands construction site

Trump issued the permit for Keystone Oil Pipeline on March 24th. I remembered my wife went to protest against the same program in 2013 when Obama was the president. My wife was really upset about the approvement, and she paints the following as a way to express her feeling. You can check out all my wife's paintings.
jpg
This painting is about the oil sands/tar sands mining. I got the inspiration after watching the documentary movie – “Before the Flood.”. I think everyone should watch this movie, especially American, because it not only discusses the global warming and climate change but also talks about what is happening right now in the U.S. This movie is so touching for me because I think it touches the essence of the issue. I never doubt about the global warming and climate change, but I'm so curious about why still so many people don't believe that. In a country having the most advanced technology, the U.S. still has lots of people don't believe science and a person who doesn't believe science can become this country's president. This fact is more unbelievable for me than the global warming and climate change. And this movie gives its answer to people like me. What's more, the movie tries to give some suggestions like what we can do to help protect the environment as individuals. I think it’s so important because it gives people some hope, especially in a time when we're so disappointed about the government's policies on environment issues. 
In the painting, the most fearful thing for me is that you thought this is surreal and then you realize it is real. This painting is about the oil sands/tar sands construction site in Alberta, Canada. The terrible fact is how human destroy our environment is beyond our imagination. Recently, Trump approved a permit for keystone Oil Pipeline Project, which will carry oil from Canada to the Gulf Coast. And this Keystone Oil Pipeline just begins here – the oil sands/tar sands mining construction site in Alberta, Canada. We should know that producing oil from oil sands/tar sands is one of the most destructive ways to get oil. First, they clear all the trees from the land, then they scrape away the shallow layer of topsoil. Also, during its process, they need to use a lot of fresh water to separate oil from sands, causing serious water pollution. A report by University of Alberta said there were thousands of birds dead in 2010 because they landed on the toxic waste ponds. So, this painting is a real scene, showing how oil sands/tar sands industry is destroying our lovely nature. And the green in distance reminds us this area’s ecological-environment before oil sands/tar sands mining.
You can see the real tar sands site below:

Saturday, April 8, 2017

Wife's painting: Life on mars

The following painting is another one from my wife, it is surrealism. You can see the descriptions below from my wife and what she wants to express in this painting. Hope it can also make us think how we can do by ourselves to protect our planet. You can check out all my wife's paintings.
jpg
Consider all environmental issues we’re facing now, such as climate change and global warming, some people already think about moving human to Mars rather than dealing with the problems on the earth. Recently, Xspace rocket company released their multiplanet traveling plan, hoping to move about thousands of people to Mars by 2040. It’s a cool idea, but I don’t think it’s a good idea, especially seeing people so excited and crazy about moving to Mars. As we know, Mars is not our heaven; it has much harder and more extreme weather than our earth. There is nothing on Mars – absolutely nothing. Its temperature is from 27C during the day to -143C at night. It’s a real hell rather than a heaven. Also, if the human doesn't change their greedy and selfish nature, we still must face the same problems someday, no matter where we are.
This painting tries to show the picture when people finally move to Mars and build their cities. But it’s not an exciting picture. The city is an inanimate concrete forest. It looks like tombstone rather than a real city. It is the graves for all plants and animals which have become extinct from our earth. There is a man standing in the foreground and looking at the distant earth. He misses it, missing its colorful and vigorous life; he wants to come back, coming back to its glorious past. But who is he? We don’t know. And I really doubt whether the human can move to Mars before we total destroy our environment and our mother earth. So, it’s empty in the suite because we don’t know whether we, human beings, can really survive to see that day happen in the future. Furthermore, in this painting. Animals and plants are kept in glass cans and hung from the sky – they are the gifts from the heaven. When our environment is not suitable to support life anymore, we hope at least we can find a way to preserve them and someday they can revive in our future homeland.

Friday, April 7, 2017

Entrepreneur training 7: Business communications

Business communications

This week, we discussed presentations, emails and meetings in business. I think it is really useful, a lot of the skills are not only in the business settings. 
Presentations
  1. Set the context. This is slide zero (what is the question, why I am here, the theme, the context setting of the presentation)
  2. Set the expectation. This should be saying the desired outcome. 
Running effective meeting
  • Starts why we are here and ask how to make a great meeting 
  • Agree on objective & the method
  • Leader's job
    • Purpose of the meeting
    • Set the high bar of the meeting
    • Make sure people feel heard
    • Ask & invite opinions
  • Who is making a decision & how
  • Be careful to people spend too much time distract the meeting to other directions.
What’s CEO’s role? the leader’s job is to drive consensus
When you look for jobs in the future, and the most important thing is whether their culture aligns with you. The second factor is how many hours you need commute.
Pitching to investors
  • Objectives are more clear
  • Meeting flow is expected (1 hour)
    • Ice breaker
    • Pitch deck (no longer than 20 min, 1-slide, 3-slide, 5-slide, 10-slide versions)
    • Q&A (20-30 min)
    • Wrap up & next steps (if any)
  • How do you know if the meeting went well
Here is a nice video by David Rose - 10 things to know before you pitch a VC
Common Mistakes
  • Over-explaining the obvious
    • And not explain what really needs to be explained
  • Rushing through slide zero (There are only two slides really important, slide 0, and the 2nd one is the Financial projection)
  • Skipping proper transitions
  • Not having a team slide with photos & relevance
Remember the purpose of the first date is to get the 2nd date. The first VC meeting is to intrigue them to the 2nd meeting. 
The procedure from meeting with VC to get the money is usually 5 - 8 months
Process: from wink to first meeting is about 2 - 6 weeks, the first meeting to the talk with expert, is about 2 - 8 weeks. This is also want to see if you can doing good. 2 - 4 weeks for the term sheet to the legal due diligence.
  • wink - (1 paragraph email)
  • request a meeting (send 2 page executive summary, don’t send the slides)
  • First meeting try to wow them, and meeting went well, they will stop, and ask, let me see if someone is available, they will never see they say no
  • 2nd meeting with more people (they try to get the consensus between different )
  • Now you should talk with our expert number (this may follow due diligence)
  • If you doing really good, then they will ask you partner meeting (the meeting location is very important, whether it is at a coffee shop, or first floor of the company, or the top floor conference room, etc. 
  • Then you will get a term sheet, 4-8 pages, with what they want to invest, no binding offer, and you have 72 hours to signing it. (this is a brief joy moment you can have before click emails)
  • Now starts the Legal due diligence, this is more lawyer to lawyer, you pay for both sides.
  • Closing
  • about 48 hours or 3 days, you get money from VCs
When you have multi-horses race, or rebound (when the VCs rejected by a similar startup), then the process will be compressed. 
Following up with investors
  • send additional info based on discussion
  • see if they pull or only you are pushing?
Usually for a VC, they got 1000 applications -> 100 closely investigate -> 20 invite to present -> 3 or 4 offers -> 1 - 2 investment per year
Emails
  • Using signatures
  • Salutation
  • Email title (subject line)
  • Suitable length
  • Sections & organization of email body
  • First line, why I am sending you this email
Pitching to Investors
Tell a story
  • saw a problem
    • Unique insights led to a thesis
    • market research validated thesis
    • assembled a team
    • know our customers & their problem & our competition
  • We know how to promote this, we have our assumptions & financial projections
  • We need XX money to get to YY stage
We talked a lot in the past about getting money from VC (Venture Capital), in order better understand them, we need to figure out the following 3 questions:
  • How do they get their money?
  • Who are they accountable to?
  • How do they make money?
Approaching investors
  • Send a short email if they are interested in this space [ideally through a referral]
  • Send executive summary to gauge interest and request a meeting
  • First F2F meeting to intrigue them ... Get a 2nd meeting!
Structure of Exec Summary
  1. What problem exists (unmet need) & who has it
  2. Current alternatives & shortcomings
  3. Your solution
  4. Your unfair advantage (magic sauce)
  5. Positioning vis-a-vis other competitors
  6. Market segment targeted & market size
  7. Business model
  8. G2M strategy
  9. Team
  10. Timeline of progress & milestones
  11. Financial projects

Acknowledgements:

All the materials are from the entrepreneurship class at UC Berkeley taught by Naeem Zafar.