Q13: What is the difference between long
and wide
format data?
The long and wide format data are used to describe two different presentations for tabular data. In the long format, it is kind of like the key-value pairs to store the data, each subject (Iron man or Spider man) will have data in multiple rows. The wide format data, each subject will have all the variables in the same row but separated in different columns. It is easier to see in an example:
long format
Name | Variable | Value |
Iron man | Color | red |
Iron man | Material | iron |
Iron man | Power | nuclear |
Iron man | Height | 5'8" |
Spider man | Color | red |
Spider man | Material | cloth |
Spider man | Power | food |
Spider man | Height | 5'10" |
wide format
Name | Color | Material | Power | Height |
Iron man | red | iron | nuclear | 5'8" |
Spider man | red | cloth | food | 5'10" |
The wide format data usually will have problems when you want to visualize it with many variables. But it is very easy to convert between these two types data using python or R.
Q14: What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject?
I think all the statistics used in newspaper or TV is to support the author's point. This is illustrated in the book
How to lie with Statistics.
It is difficult to tell whether the statistics is wrong, since we need reproduce the results if we are facing the question and want to solve it by ourselves. But many times, we can not re-do the work, so I don't have a better way to determine if the statistics are wrong.
The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.
What he means is the elements in the charts are unnecessary to convey the main information, or distract the viewer. See the following example from
WiKi:
Q16: How would you screen for outliers and what should you do if you find one?
An outlier is an observation that is distant from the other observations. It may from two cases, (1) measurement error or (2) heavy tailed distribution. In the first case, we want to discard the outlier, but in the later case, it is part of the information that need special attention to.
The way I usually to find an outlier is using the
box plot, and if the data point is outside of the 1.5*IQR, I will treat it as an outlier. See more details
here.
There are more discussions whether to drop the outliers
here.
Q17: How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
I don't know the extreme value theory, so will talk using the Monte Carlo simulations to estimate the chance of a very rare event.
A very rare event is an event occurring with a very small probability. So if you use Monte Carlo directly, it will be really inefficient for sampling the rare events. For example, if the instance occur at the probability 10-9, then on average, we need a sample of size n = 109 to observe just a single occurrence of the event, and much more if we expect a reliable estimation of the mean and variance to obtain a sufficiently narrow confidence interval.
Importance sampling (IS) is a powerful tool to reduce the variance of an estimator, that is increasing the occurrence of the rare event. The basic idea of importance sampling is to change the probability laws of the system under study to sample more frequently the events that are more important for the simulation. Of course, using this new distribution results in a biased estimator if no correction is applied. Therefore the simulation output needs to be translated in terms of the original measure, which is done by multiply a so-called likelihood ratio. See the details in
chapter 2 of
Rare Event Simulation using Monte Carlo Methods by Gerardo Rubino and Bruno Tuffin.
Q18: What is a recommendation engine? How does it work?
A recommendation engine is a system that can predict a user's rating
or preference
based on the user's activities or other users' activities. One easy example is Amazon, when you browse the books, you always see 'Recommend books for you', and 'Other people may like this'. This is Amazon recommendation engine based on your browse history and other people's browse history.
Collaborative filtering
This approach is based on collecting and analyzing a large amount of information on users' behaviors, activities or preferences and predicting what users will like based on their similarity to other users (from WiKi). So the assumption here is that similar people tend to have similar behavior, if we know their past behaviors, we can predict their future behavior. For example, I love machine learning, always browse interesting machine learning books online. Another person also loves machine learning, and he has another set of books he usually browse. So if the system recommend a book I love from me about machine learning to this person, there's a high chance this person will love it as well.
Therefore, in order to use this approach, the first thing we should do is to collect a lot of data from different users, and use them as the basis to predict future user behavior. The advantage of this method is that, we even don't need know the content of the product, all we need is collecting large dataset about users' activities. On the other hand, this is also the disadvantage of this approach, since we have to first collect a large dataset to make this approach work. Therefore, when you start a new business, you may need wait for a few years to start to recommend product to your users.
Collaborative filtering In contrast, the collaborative filtering approach is based on understanding of your products and the profile of your user, i.e., information about what this user likes. Then the algorithm will recommend products that are similar to those that a user liked in the past. For example, I love Iron man movies, I watched Iron man I, II, III in the past. Therefore, if the algorithm recommends me a movie 'I, Robot', there's a high chance I will love it (of course, both are my favorite). But if the algorithm recommends me a love movie, I would say, there's still room to train the algorithm to be better.
Collaborative filtering approach doesn't need collect large dataset as the first step. Instead, you can start to recommend product to your users from the very beginning after he starts to choose your product. But you can see that, you now have to learn the characteristics about your product, this is the trade!
Hybrid model Hybrid is always the best )^ We can also combine the two methods together, this is the Hybrid recommend system: by making content-based and collaborative-based predictions separately and then combining them; by adding content-based capabilities to a collaborative-based approach (and vice versa); or by unifying the approaches into one model.
Q19: Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
This question seems duplicate with question 4 and 10, so I will skip this one.
Q20: Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
Since I am mostly using python in my daily life and research, so all the visualization tools are in python.
Here are the ones I use most:
- matplotlib, most powerful and flexible package that can plot nice figures. A lot of the figures I generated for the papers are from it.
- pandas, is not only an analysis tool, but the plotting capability is also really great.
- Seaborn, I use it mostly for the quick nice plot style. Since the new matplotlib added the different styles, I guess I will use it less.
- Bokeh, a very nice interactive plot package. I usually use it to generate the html interactive file for easily passing around.
- Basemap, this is what I usually use for plotting maps.
- cartopy, a nice package for quick map plotting.
- folium, my favorite interactive map plotting package built on top of leaflet.
Here are some more nice package, I occasionally use.
*
geoplotlib, a toolbox for visualizing geographical data and making maps.
*
mpld3, another interactive plotting package that built on top of D3.
*
pygal, another interactive plotting package. It can output figures as SVGs.
*
ggplot, plot based on R's ggplot2.
*
datashader, can create meaningful representations of large amounts of data.
*
missingno, a package to deal with missing or messy data.
I never used Tableau, R, SAS before, so I cannot say anything about them.
Instead of plotting them directly, I usually first see if I can explain most of the data in lower dimensions, say 2 or 3 dimension by using
Principle Component Analysis.