Hamish Woodrow

Why do we use n-1 for Samples?

2018-04-05T22:00:00+00:00

These are one of those concepts that definitely slipped me by at a younger level and while I was in Engineering school. The bad habit of applying to just one set of problems (dealing only with samples) meant I always used one formula to find the variance and square root.
What difference do I mean, well it is that when estimating population parameters we divide our by “n” and when we deal with calculating sample statistics samples it is “n-1”.

As you can see in the above two formulas, on the left we have the population parameters with the variance estimated by dividing by n. On the right we have the variance of a sample of that population where we get the estimate of the variance and by dividing by n-1 we get what is called an unbiased estimate. The other slight variation is that we use the true mean for the population (it is a parameter of the system) and we use the mean of the sample on the right equation.

N.B. a sample is simply a subset of the population for example measuring the number of cars passing a junction for a week is a sample. The utilisation of n-1 in the calculation of sample variance and sample standard deviation is referred to as Bessel’s correction.

Intuitively

When we know every single value in the population then we can calculate the variance exactly and this forms a parameter of the system. When we take a sample, we are only getting a bit of the information. We are only receiving a small bit of knowledge of the total spread of our full population. Now simply the fact that we have less data points means we are invariably going to have less spread away from the mean. By only taking a subset of the points we are very unlikely to capture the true variance of the system. For example if we have as illustrated below a population of 20 points and we sample 8. The majority of those points are grouped around the mean, so when we take our sample we are more likely to select points close to the mean and less likely to get points that deviate more from the mean. The below plot shows a realisation of a a couple of different realisations of random sampling. What we can see with each one is that our sample points are more closely regrouped around their sample mean. This sample mean will always sit in the middle of your sample, but this sample mean could be completely different from the true mean, in fact you may even have scenarios where the population mean does not even intersect your sample set. Therefore, by taking a sample we are invariably capturing less of the variance and are likely to reduce the variance of the system.

Therefore, if we divide by n to get the variance of the sample (we get) a biased estimate, since our variance is likely to underestimate the true variance of the population. Therefore, to balance this we correct for this by using n-1 (having a smaller denominator) which has the effect of increasing the variance of the system making the sample variance tend towards the true variance of the system.

I have posted a nice little video which is available through Khan academy and a small tool which gives you an intuitive visualisation of why using n-1 gives us an unbiased estimate of the population variance.

More Mathematically

When we have

Common Vocabulary used in statistics

2018-03-20T22:00:00+00:00

This is a resource of common terms I come across whilst reading papers, that were before reading the paper unclear for me.

Degrees of Freedom

Is the number of values used in a calculation which are free to vary.

Difference between a statistic & a parameter

A parameter is a definitive value it is the truth. What is the average age of my class, well if I ask everyone in my class and you define my class as the population then well if I ask all 20 of them then the average I have is a parameter. A parameter is irrefutable.
Now it is rare that we are able to ask everyone in a population generally a population is much greater than it is either physically or economically possible to ask. This is where a statistic comes in. A statistic is an estimation of the parameter of our system based on a sample. Now what is important is that since a statistic is based on a sample it is disputable, it is dependent on my sample. This is where the whole realm of sampling techniques come into play, to try to ensure that our sample is representative. In addition, with a samples statistics we can develop a notion of a confidence interval which we will come onto later.
Getting representative data and constructing your analysis in the right way is something that is exceptionally complicated. For example a lot of the changes I am working on just now for a company relate to designing new parts of the site. Now we have a current group of paying customers which we use to test modifications for the new site, the problem we encounter is that the average age of our current customers are in their 40s and we are looking to design a site to attract mass appeal, to increase customer appeal. Now are design changes validated statistically by our current set of customers, which may not be representative. So perhaps we may have to look at adapting strategies to wait longer to have a sufficiently diverse population to make decisions on, but then too long a test could be costly. Getting a good statistic is hard at times!

Endogenous Variable

Is a term we can come across often in linear regression in particular when it is from the point of view of econometrics. It has parallels to a dependent variable, and in essence is a parameter which can be fully described by a model (for example a set of simultaneous equations). For example the price of a house may be related to its location, size, number of bedrooms etc. Therefore, the price of the house could be described as an Endogeneous variable. In reality it is not completely determined by the system of equations as there will be always variables which models may not take into account, for example personal feeling towards the house, the type of neighbours etc.
The etymology of the word from Greek is apt to describing what it means ‘endo’=’within’ and ‘genous’=’producing’. Therefore, we can think of it as a variable produced from within the system. _

Exogeneous variables

In contrast to the above these are variables which are not captured or explained by other variables in the system (‘Exo’=’outside’). These can be thought of as similar to independent variables. In the above example of the price of the house, a variable outside the system is for example the selling ability of the estate agent. Their personal ability is not affected by the house price, they are independent to the system and their given abilities are fixed on entering the system. Afterwards, the exact separation of exogenous and endogenous variables can be somewhat complicated as determining the boundary of your systems can be quite subjective.

References

[1] http://www.statisticshowto.com/endogenous-variable/

Simple Methods of understanding data (range, variance, standard deviation)

2018-03-13T22:00:00+00:00

The Problem with the mean

So I have just finished up a post describing the relative merits of the arithmetic mean and it’s beautiful simplicity. Though the arithmetic mean along with any other representation of the central tendency have limits in explaining data to us. Imagine two sets of numbers [-10,0,10,20,30] ==> mean=10 & [8,9,10,11,12] ==> mean = 10. See with these two data sets they have the same mean but the spread (dispersion) of the data is drastically different. So that is why we need to explore additional methods for describing our data. This is where variance comes in, the mean, median are their to give us an indication of the central tendency, but in order to summarise the data properly we need further summary ‘statistics’. I think this is an important point which I find beautiful with stats but I only came to understand later on. We are on a mission to capture the essence of our data by summarizing it. In the same way that a blurb gives an insight to the book, summary statistics gives you an idea of understanding what your data is like without having to look at each piece of it. Fundamental stats is something our brains do intuitively (even if we do it badly by adding our own bias to it), but it is the way we think. When we say the “bus usually arrives between 4.30 & 4.40”, we are describing spread, we are describing the variance of the data.

How to describe spread

The simplest way to look at the spread of the data is to look at the range. This is just taking the max-min, in general when looking at a data set it has little practical use as it does not look at the frequency of the occurrences and is therefore highly influenced by outliers. So practically it may have little use, but it always good to understand the limits of your data, we use it in scenarios when we look at the high/low of a stock price. Also, it provides a good quick way to see if you have outliers in your data which you need to look at.
But a more standard metric of spread is the variance and by transformation the standard deviation.

Variance

The variance is the deviation of every single data point you have away from the mean squared. So in another way it is the average of the squared difference of a point from the mean. The square is there so that we can look at absolute deviations and not be influenced by negative numbers. It is interesting to note when we talk about variance is to link it to how you evaluate predictive statistical models, we look at the errors (which is the deviation of the prediction from the actual value). This is the same as what we are doing with variance where we take a very simple model, predict your data by taking the mean of points and evaluate the deviation. Now variance is denoted by subscript sigma squared (\sigma^2)

Taking the example above: variance_1 = [(-10-10)^2 + (0-10)^2 + (10-10)^2 + (20-10)^2 +(30-10)^2]/5 = 200 variance_2 = [(8-10)^2 + (9-10)^2 + (10-10)^2 + (11-10)^2 +(12-10)^2]/5 = 2

And there is a casing example of using statistic values to explain your data, with the mean they were the same and now with the variance we know that the data in example 1 is greatly more dispersed and as a result we should expect to have values more away from the mean if we take a sample.

Things to think about with variance

Well first off the value of variance is actually in itself quite a weird term, it gives an indication of the spread by its magnitude, but the actual value is not very descriptive. For example estimating bus arrival times, the variance will be described by min^2, which is not very useful for me. This is where the standard deviation will come in. The second thing to think about is the impact of the outliers on the variance, the fact that we square the difference means points which are very far away from the mean will dominate the value of the variance. Try it with the second example if you for some reason had an erroneous point, like someone typing 120 instead of 12 your variance goes to 2400. It is something really important to consider that you need to understand what is the likely spread of your data, and look at methods of removing anomalous points. This is a problem I faced when dealing with large quantities of data inputted by hand, there was a lot of pollution of the data. Finding ways to deal with this is important. Anomaly detection is another topic, you can use statistical models, but also, think about also using physical models where possible. For example in engineering infrastructure physical principles will likely set bounds for your model.

Standard deviation

So the little brother of variance, and most likely the more often used at the day to day level. It is the square root of the variance. \sigma = standard deviation = sqrt(variance)

The standard deviation is nice as it returns the variance into units and a magnitude that we can understand and compare to our data. Also by the fact of its construction (mainly the introduction of a square) it has a lot of nice properties in statistics.

Considerations with Standard Deviation and variance - skewed data

Something to consider and the same goes for both std. deviation and variance. Is that both methods use the mean to calculate the spread of the data. Now that is great if you have normally distributed data (which does happen quite a lot), as the mean is a good representation of the central tendency on the data. However, if the data is heavily skewed (positively or negatively), then the deviation will be hugely impacted by the outliers at one end of the skew. This makes the statistics less useful, when trying to understand the spread of the skewed data it is often better to use the median and inter-quartile range (IQR), as you will be better able to capture the spread around the median both positive and negative.

History of Variance and Standard deviation

Again like many of these recognisable terms there exact origin/inception has many persons t attribute. Though in terms of coining the term variance and its formal definition it was Ronal Fisher (1918), in his paper on population genetics which was submitted to my home town of Edinburgh. Now the general use of a term similar to variance dates back longer than this. Gauss again comes up, in his analysis of the position of stars. He had an issue in that he had many measurements for the position of a single star, but was trying to give its most likely position. He assumed that a star had a fixed position and therefore the measurable points would simply be deviations away from this fixed zero position. These errors were determined by a probability distribution away from the true position of the star. The parameter determining the distribution away from the central tendency was expressed at the time as a parameter called precision. Precision is in fact related to squared deviations which in turn is related to the variance.

References

[1] Khan academy video, https://youtu.be/E4HAYd0QnRc

[2] Image of skewed data, http://schoolbag.info/physics/physics_math/71.html

[3] Earliest Known Uses of Some of the Words of Mathematics, http://jeff560.tripod.com/mathword.html

[4] Why is variance calculated by squaring the deviations?, https://www.researchgate.net/post/Why_is_variance_calculated_by_squaring_the_deviations

Arithmetic mean - The defacto average - It’s past and present

2018-02-26T22:00:00+00:00

Intro

So this may seem a ridiculously basic blog to write, the arithmetic mean is after all something that we learnt either intuitively or at least at a very young age. But this is part of a series of articles I will write on basic terms and in general statistics. I have been returning over my Maths knowledge and finding many fundamental gaps, I know how to implement it but why I am implementing it, is something I have not been able to answer. I have also taken a somewhat canonical viewpoint to my learning, understanding the history of where the formulas or theories we use come from. I have also had a gripe from how I learnt statistics in formal education and now, what is beautiful about statistics and probability is that the reasons - at least most of the fundamentals - were developed have such direct applicability to day to day life and they were researched and formalized due to very tangible needs.

The Definition

The arithmetic mean is part of a set of three commonly used averaging methods (harmonic and geometric being the others). So it is simply the sum of all the observations divided by the number of observations. In more precise terms it is a description of the central tendency of a set of data. Fundamentally every point is given the same weighting (this is both a positive and negative, which we exploit in distance metrics). Out of harmonic and geometric the arithmetic mean is always the greater one.

Origin

There was no written law saying that we had to average data by the arithmetic mean, it is not some form of fundamental axiom, but it is as natural for us to use it as adding numbers together. There lies its reason to have been formalized, it is so intuitive. It is our method of making sense of all the data signals we have, what time the bus arrives in ‘general’, how many emails I get on an average day… Humans are bad at dealing with lots of data, but summarize all this data into a single value and we are good. Now the need for an average got more formally put when dealing with recording data and like much of early statistics, it benefitted from analyzing space.

So it started a long time go the first written account being with the Babylonians. They were tracking the position the moon, sun and planets and trying to summarize all their different recordings. It is however more traditionally attributed to Hipparchus (190 BC) a Greek astronomer fellow. Though formally defined by Thomas Simpson who presented his formal definition to the Royal society.

Gauss a guy I will obviously be coming back to in later articles on statistics origins said this: “It has been customary certainly to regard as an axiom the hypothesis that if any quantity has been determined by several direct observations, made under the same circumstances and with equal care, the arithmetical mean of the observed values affords the most probable value, if not rigorously, yet very nearly at least, so that is always most safe to adhere to it.”

I like this quote as it comes to a recurring thematic in early research in statistics and like most of mathematics, how do I understand the world around me. This is the core of what I see is in data science and statistical modeling, how do I take all these observations and derive value from them so I can make a better decision or a better prediction. I mean at the simplest level we do this in our head, the bus arrived at 2.15, 2.20, 2.12 what time will it probably arrive today. Telling me a hundred bus times in itself is useless but recombining it into a single number is what makes me able to decide when I leave the house. This is the same objective as we extend, maybe we start to add traffic data, weather data etc. but the idea remains the same as the simple objective that Gauss outlines, “[what value affords the most probable value]”.

What does the arithmetic ‘mean’

Here I mean why is called the arithmetic mean. Well this is because it comes from the notion of an arithmetic series, these are series where each adjacent is a fixed term from the adjacent terms. Now here we get mean by adding up adjacent terms hence arithmetic.

Why is it the biggest mean

I really liked an explanation on Quora for the reason why the Arithmetic mean is the bigger of the means (a link is provided below, I will paraphrase it here). If we discuss just about two parameters we can look at it through the lens of geometry. If we look at the below circle we see that we have a and b. We also have a chord labelled h which divides a and b. Now the arithmetic average of a & b will always be the radius (a+b = diameter = radius2 ==> radius = (a+b)/2). you can move the chord h and change the size of a+b but they will always sum to the same value.
Now for the geometric mean however you would multiply a & b and take the square root geo_mean = (ab)^0.5. Well lets look at chord h, to calculate the value of h we should recognize that theta the angle adjacent to a is equal to the angle opposite b. From this we see that tan(theta) = h/a & tan(theta) = b/h, bringing this together we see that h^2 = ab ==> h = (ab)^0.5 == geometric mean. Therefore, the geometric mean is represented by the chord h and the arithmetic mean is constant H (the radius). The geometric mean has a maximum value of H, hence the inequality of the geometric <= arithmetic mean. It also shows a nice fact of how the relative size of a and b effect the geometric mean but it is indifferent in the arithmetic mean. This shows a limitation as well, in the case that a could be 0 and b is 6, the average would still be 3, which is not really a representative value for the central tendency of the group. Imagine if this was hourly salary, if one person worked for free and the other earned $6/hr it may be not be representative to say they on average earn 3 dollars.

References

[1] https://www.quora.com/Why-is-the-arithmetic-mean-always-greater-than-or-equal-to-the-geometric-mean [2] http://www.famous-mathematicians.com/hipparchus/

Anomaly Detection Library in Python

2018-01-25T22:00:00+00:00

Since coming across a number of problems where I needed a simple anomaly detection algorithm I decided to start building out a library containing useful functions for both identifying anomalous data points but also visualising them. I have a couple of blogs explaining the techniques used which I reference below. I have added links to the Github repo as well as a jupyter notebook with example implementation.

Link to Github Repo

Link to Jupyter Notebook

Current Project Status

This is a times series anomaly detection algorithm implementation. It is currently focussed on catching multiple anomalies based on your time series data dependent on the confidence level you wish to set.

The anom_detect.py can be downloaded and imported, alternatively you can follow the Jupyter notebook to look at an example implementation (links provided above). The result of implementing this method is the generation of plots (shown below) and tables displaying the detected anomalies in your data.

As a general suggestion to anomaly detection is you should to get to know your data. This library is a simple implementation looking to see if the deviation of a point from the trend of the data is explained by the variation of the dataset or not. The selection of the signficance levels is dependent also on your ability to process anomalous points.

Implementation

The algorithm computes a moving average based on a certain window size. The moving average method used implements a simple low pass filter (using discrete linear convolution). This sounds complicated but it is not so bad (I will upload a blog to explain), it is nicer than rolling average methods which don’t deal with boundaries of your data very well (early time data not properly averaged).
Then using the moving average as the trend of the data each points deviation from the moving average is calculated and then the generalized extreme Studentized deviate (ESD) test - an extension of the Grubbs test to multiple anomalies - is used to evaluate the actual data points to identify if they are likely to be anomalous dependent on a user set confidence level (alpha).

Pandas tricks - Aide Memoir

2017-12-20T22:00:00+00:00

This is a small resource I have used to record functions that I commonly use and I also commonly forget in Pandas.

Dividing Values in the same row by one Another (pandas.DataFrame.apply)

For example if I am given the first two columns of the table below, and I want to calculate the amount spent per day (third column), dividing the total_amount by the number of days.

total_amount	days	amount_per_day
10	4	2.50
8	4	2
12	3	4

This is achieved using the apply method. You pass the apply method the function you wish to manipulate each row with. Lambda functions are particularly useful if the operation is simplistic, alternatively you can pass a function defined elsewhere as a parameter. The columns in both cases can be referred to by using the columns name as a series index to call the value for that column on a given row.

I have shown two examples below one using a lambda function to calculate the number of days for a transaction by dividing the total amount by the amount claimed per day. This is nice and short so a lambda function works very well. Though as the second example shows it is possible to also pass a function to apply to separate the row operations.

In both operations we are wanting to treat each row individually when applying the function therefore, axis=1 is set, the default is axis=0, which applies the function to each column.

df['amount_per_day '] = df.apply(lambda x : round(x['total_amount']/x['days']),axis=1)

Second example, same result just achieved by passing apply a function:

def average_expense(x):
    av_exp = x['total_amount']/x['days']
    return av_exp

df['amount_per_day'] = df.apply(average_expense, axis=1)

Link to docs

Filtering out non-numeric values from columns

If you have done much recuperation of data from old excel sheets you will find that they are often filled with mixed types. omething I come across commonly is people putting text into predominantly numerical columns for example in the order_id column of a spreadsheet a user may have put someones name ‘Joe Bloggs’ / ‘Error’ / ‘stock purchase’. Unfortunately types are not well controlled compared to databases. But excels especially in older companies contain lots of useful data. This means when you try to look in your SQL database or join tables or just try to insert your excel data into sql you will have lots of errors. Anyway this was frustrating me in Pandas and I used the below code snipet to solve my problems.
It requires passing a regex expression (in my case it checks the first and last character are numeric) and then the match method to return a bool depending on if the entry returns values from the regex expression.

index	order_id	amount
1	4000	2.50
2	4001	2.00
3	missing	4.00

num_filt = re.compile('^[0-9]*[0-9]$') # define regex filter

df_test['order_id'].str.match(reg) # This returns an expression with boolean as to whether regex statement is satisfied.

Therefore, to return only rows which contain valid order_id, you can filter the data table based on the above bool.

df_test[df_test['col1'].str.match(reg) == True] # Then the bool response is used to filter table rows.

index	order_id	amount
1	4000	2.50
2	4001	2.00

Converting data post groupby into bools for count

df.astype(bool).astype(int)

Structuring Data Science Part 2

2017-11-06T22:00:00+00:00

Snapshot

A secondary method for structuring online data science resources. A video presenting the project and initial results is below, I will update with the full blog soon.

Overview of Methods Used

Used topic modeling to relate technical resources to one another. Extracting the fundamental concepts in a text and creating links to other resources by their concept similarities. Allowing a user to read a journal and directly see videos or blogs covering those same concepts.

TFIDF+SVD pipeline created to generate topic similarities based on concept vocabulary created.
Network graph created for topic ontologies and PageRank used in recommendation engine.
MongoDB database created for hosting the different data collections.
Multiple web spiders developed to mine blog pages and video transcripts with a corpus of 80k blog pages, 30k papers and 8k video transcripts curated.
Deployed through a flask generated website.

Managing my finances - How to link Spending and Location?

2017-11-01T22:00:00+00:00

Snapshot

I cover an ongoing project with the objective of creating a better budgeting and personal expenditure platform. I have created a program which takes all my credit and debit card statements and by combining it with my phones location, label each transaction with a location and a type (eating-out,shopping…). This allows me to aggregate my expenses much more clearly by understanding how I spend money. It also allows me to integrate all my bank accounts into one dashboard for an analysis, giving me one platform for an overview of my finances. Below I have given an overview of the initial implementation I have chosen to transform the simple bank statement into a more insightful set of numbers.

A link to the Git repo is here

Objectives and Reasons

So I have a problem derived from spending many years working in different locations, I spend in different locations and different currencies and with banks in different countries. In addition, I have the problem that personal budgeting has never been a forte of mine. I recently subscribed to a bank called Monzo, it offers a prepaid card, and the app it offers is really pretty nifty. One of its many features is that when you look at your expenses, it categorizes it by type (shopping, eating-out etc.) it also geo-locates your expense. I find this pretty useful, as what I like to look at is a breakdown of my expenses, and also where am I spending this money.
Now since Monzo is just a prepaid card, it cannot be integrated into my day to day banking. But I thought why not look to achieve something similar with my own statements. Take my statements from online and then map the type of expense to it as well as the location that I spent the money. Giving me the ability to better aggregate my expenses by location and by type. Giving me lots of ability to achieve historical analysis of my expenses and in general see my expenditure trends. Now a big issue also is I didn’t want to sit and manually label thousands of data sets so I wanted to achieve this with at least a semi-unsupervised technique.

Implementation

My method to achieve mapping type and location I used the credit card statement to derive the companies name and then I used my phones location (downloaded straight from Google) to then map approximate locations to these expenses. Then based on an approximate location and the companies name where I spent the money I queried Google’s Places API in order to find establishments matching the criteria. Now sounds simple, but turned out to be a bit more involved to get good accuracy. This was for two primary problems firstly descriptions in credit card statements are ambiguous (they rarely just state the companies name), secondly an expense is logged by day with no time associated with it, so which of the on average 100 locations which Google logged me at per day (yes, over 6 months I had about 120 000 distinct coordinates, quite scary).
There are of course further problems like, what to do for online companies, what to do for companies with numerous establishments there are quite a few Peet’s in San Francisco. Before I delve into the details I would just like to grind a gear, that all this inherent data, exact location and type of business, is in the systems of the banks, it is quite infuriating not to have this on my statement. This seems like a lot of effort just to record and track my expenses.

Getting the companies name

When you download the CSV file from your bank it has a general description of the expense which relates to your transaction but it is not in plain english. The below examples are quite common:

Now you and I are able to decipher this out because it is us who spent the money. But getting a systematic method to get a computer to parse out which part of the description was the company is not quite as easy and relied on several layers of modeling and parsing techniques. In addition, in order for the system to be robust it needs to be able to disambiguate Amazon EU = Amazon UK = AMZEU = AMZiggolo. All these expenses are from Amazon which should be classified as online-shopping. I set out an objective to minimize the number of terms I would need to predefine. Unfortunately there are just too many variations and possible expenses that it becomes important to be able to automatically extract similarities and to identify the part related to the company. In addition, I have a further constraint that Google allows me to use about 2500 api calls per day, so I want to make sure the API call I make has the highest likelihood of getting a response and it would be impractical to query all variants of the name.
In order to achieve this I tried a purely heuristic model with certain rules based on the text, I did the following:

For each description tokenize the word and score each word for its likelihood of being part of the companies name
Inspect the length of the word, if longer that 5 letters it is more likely to be a word.
Check if the word is in the dictionary (+1 point). It is at least a word.
Check if there are any embedded words in tokens longer than three letters. A token is more likely to represent a word if it in fact does itself contain a word, and if it is an actual word then it is more likely to form part of the companies name.
Previous word frequency, common expenses for example Starbucks may appear on the statement as such “Starbucks z0Jan17” but the word Starbucks will have a higher frequency if I have often gone to Starbucks. The noise around the companies name will change but the name will be the constant.
Previous phonetics, now this is less important for company name identification, but is more important in the disambiguation. I use fuzzy with depsan2 to create a phonetic representation for each word and see how often that phonetics occur. Actually using phonetics is quite useful as for example AMZ = Amazon and there 3 character phonetic representation is the same and often when it comes to looking for acronyms we look for phonetic similarity.
Word structure, how many vowels and consonants, a token with 2 or more vowels and more consonants is more likely to represent a word.

Disambiguation

At this step after step one I have the predicted company name, but this could have multiple representations for the same company i.e. AMZ, Amazon, Amazon eu In order to relate these and its variants together I used a number of techniques and if all 3 conditions were satisfied the company was declared a subset of the others:

Phonetic representation, I looked at the phonetic representation of the first word and see if this matches any other in the company list.
Check if all the letters of one companies name is in the others, for example {A,M,Z} intersects {A,M,A,Z,O,N}.
Check if the first letter is the same. Simple but pretty important.

Location of the expenses

In essence for the google query I need to feed in a location and a company name. Now the problem is that my google locations file contains maybe a hundred visited locations on a given day. So what I do (and this is crude, I wish to implement possibly a k-means approach) is I pick from the database of visited location for the day of the expense, one distinct location (this should hopefully give me at least the city I am in). I send this along with the predicted companies name to google. Now google responds with a json, with a number of results matching my query. I take the results of this each of which is a distinct company with GPS coordinates, and I calculate an array with each possible company and its distance to each of the locations I visited on the day of the expense. This means that hopefully if Google has done a good job of tracking me then when I look for the smallest distance it will likely be the location I visited.

Further Improvement

This is an initial version of the code, much progress is going to be achieved. Particularly in the company name recognition. I plan to move to a Naive Bayes approach, once I can develop a labelled corpus of text to parse. Naive Bayes will be my first go to, since I am unlikely to develop an extensive corpus (as I will be making it) of labelled examples. Below I have listed a few additional improvements which will come soon:

Dashboard
K-means clustering of locations, to decide GPS coordinates of location (replacing distinct). This is to overcome the problem of multiple cities/countries in a single 24hr period.
Reduce API queries through remembering previous visited locations and querying name and location.
Augment defined company types with existing databases.

Tools/Packages used

Python and standard numerical packages
Fuzzy (for phonetics)
Sqlite3 (local database system)

References

Using fuzzy for matching with sounds [http://www.informit.com/articles/article.aspx?p=1848528].

Structuring Data Science Part 1 (the Doc2Vec route)

2017-09-25T22:00:00+00:00

I have found an issue in my learning of data science, that navigating through all the material is somewhat of a nightmare especially when starting out. We are at a time where we have a prodigious volume of learning material out there be it blogs, videos or open source papers. The problematic I find is that there almost exists too much material today, it becomes a nightmare to navigate all the content, where do you start, which subjects are connected to one another, where should I go next and what is fundamental to understand about each subject. My project started as something quite simple, in the fact that I was reading a paper in data science and coming from an engineering background I did not necessarily understand all the concepts required to fully understand the paper and beyond this I believe in canonical learning - going to the root - and doing this in Data Science is somewhat difficult.

The reason - and I will give my 2 cents regarding this - is that data science is an amalgamation of so many different domains that it is in its nature ill defined and evolving. As such the subject is exceptionally different to interpret and access from the outside.

Fundamentally, for me I saw this huge reservoir of resources online but no good or structured way to navigate it, out of the hundreds of blogs to read - which to read. An issue is there is such a breadth of subjects that you can quickly become familiar with 100s of “Data Science” terms (Deep Learning, ANN, SVMs…) without having any degree of understanding of those concepts. Google is great when you have a specific question but trying to navigate a domain is more difficult you don’t want the best match but the underlying web. In addition, Wikipedia is a fantastic resource, however, it has too many options, it is too unconstrained and only text based.

My plan initially was to take a paper, write a program to extract its concepts and link it to videos which match closely to the concepts of the paper. In addition, based on this to look to construct a document hierarchy based on the concept similarity of documents.

As you will find out, my first attempt was maybe not the most fruitful, in terms of end objectives, even if it taught me a great deal regarding the various techniques out there, I also saw a limitation in any set of methods which try to become completely unsupervised. So if you just want to know the way that worked finally and allowed me to obtain my objective go to part 2.

What did I achieve

I hit one of my two objectives, I achieved document similarity and sensible document recommendations. What I did not achieve was good categorization of documents.

The Processing

Prior to all my stages of data science analysis, I had a massive if underwhelming data science scraping phase, where I took around 2000 papers from Arxiv which were under various search terms related to data science (data science, machine learning). In addition, using Selenium (there is a better way with youtube-dl, but I did not realize at the time) I scraped Youtube for their transcripts, in all taking around 400 hrs of video transcripts. After this came the exciting stuff… In essence to achieve my analysis I used Doc2Vec to create document vectors for each of the documents or video transcripts. Once I had individual vectors for each of the documents I used recursive K-means clustering to divide my document vectors into sets, each time splitting each subset into a smaller subset creating a 5 deep hierarchy. This left me with approximately 3000 subsets which were all linked to master nodes. Now the problem I faced is this unsupervised technique of splitting my documents created a sort of concept hierarchy, but each of these concepts remained unnamed. In order to name these clusters I used a separate methodology where for each of these clusters I used TFIDF combined with NMF. This allowed me to yield a subset of words which defined concepts, from this I used the top 5 set of words from the highest concept to define the name of the cluster.

The reflections

This methodology allowed me through the integration of two techniques to develop an unsupervised methodology to cluster documents and also derive concepts from them. Which for me is a pretty cool thing, given a corpus of documents which have some sort of underlying links between them, you can develop a hierarchy. I mean it is in essence a presence under the bias, that at the top level all documents only have similarity at their most basic levels (statistics, mathematics…) and at the deeper levels as the documents cosine similarity approaches then the documents should get to have a finer level of concept matter.

Unfortunately the reality was that although the document clustering worked, the documents were just a mess of concepts there were too many links to create coherent clusters in this methodology. In addition, although during my bias (tired moments) I believed there to be coherent cluster names, it was not high enough in accuracy, the clusters often didn’t make sense, or contained lower level subjects (inspect the network tree above). This sort of method although nice and clean, was not working.

At the top is a D3 showing the hierarchy created. Also, below I show some of the cases which work, but also the inadequacy of the models when it comes to video recommendations. The percentage is the similarity score, but it is clear that 1) videos, I had an insufficient corpus to derive good recommendations for each paper and 2) the model I setup did not do a good enough job at finding relationships between text content.

Also, during the process of putting together this model I did try a method of using LDA, in order to derive topics from the overall corpus as we can see there are again some sensible concepts but the problem in this technique originated from my tokenization and parsing methodologies. In order to get this working, I need to achieve a much better method of excluding unimportant words and n-grams, in particular names, introduction of POS and NER would be useful, but in the next strategy I will also look at application of a knowledge base to improve the quality of the concepts.

You can play with the results of LDA in the visualisation at the bottom of the page.

The forward plan

So what I realised and is that this method will ultimately achieve limited accuracy. Documents are not a hierarchy but an ontology, I actually believe documents should be treated as twitter posts with multiple hashtags. An individual document sits in a web of concepts and is connected through multiple edges to these concepts. The weight of the edges depend on the level of similarity this document resembles to that concept. This is the way I needed to approach it, I needed to develop document ontologies, not single hierarchies. The power of this is also that it achieves my second objective which is that I believe that humans themselves are poor at defining and tagging documents, it depends on where you come from, if you come from a more statistical background you will tag differently a document, post it in a different journal or repository than a computer scientist, and the problem is that these functional domain barriers create separation of content which is inherently similar. Therefore, describing an ontolgy through the topics covered in a document will allow for domains to be broken down and only their edges to be observed.
WEll so this leads to my second iteration…to follow in a second blog as I explore new methods to achieve a structure to the web of data science resources.

Tools used

-For this Sklearn was used for tokenization and the overall NLP pipeline.
-Selenium was used to scrape youtube transcripts. -PDF parsing and conversion to CSV was achieved using pdfminer -Visulisations achieved using D3 and pyLDAvis

All code available on my github: Click Here

House Prices in Edinburgh

2017-08-06T13:00:00+00:00