Alternative Finance Data for Emerging Markets: Natural Language Processing (Part III)

The following was written by CTO Malcolm Kapuza, who dives into some example code showing how alternative finance data in the Capital Finder is categorized, using machine learning and natural language processing.

As I mentioned in the first part of this three part series, the Capital Finder has many useful applications. In the second part, I discussed in detail where we source our different data points and how we turn this data into one coherent and intuitive view of developing world alternative finance. In this final part of the series, I walk readers through the steps — and even some of the actual code — that makes this all possible.

By the end of this blog, you will learn all of the high level steps for creating your own data pipeline, including some of the more detailed steps of using Machine Learning to build a Natural Language Classifier.

In order to try out the code examples yourself, you will need your own dataset to process. You can find datasets online to practice with; for example, the UCI Machine Learning Repository, or the NLTK python package, which also provides easily accessible data for you to work with.

Gathering alternative finance data

The first step in any data pipeline is gathering data. At AlliedCrowds, we have created a proprietary web scraper that makes use of the python asyncio library to make concurrent HTTP requests. This allows us to scale to hundreds of thousands of web pages per day on a standard machine.

Asynchronous functionality is important if you need to scale a process that is input/output bound. With web crawling, the process is bound by the network speed and your target website’s response time. The average time that it takes to load an uncached website is approximately 5 seconds. Needless to say 5 seconds per request is unacceptable, given that we will be making millions of such requests (e.g. 5 millions seconds is ~58 days).

Rather than making one request to one server and waiting for a response, we make as many requests as our CPU can process and our RAM can hold in memory. This means that we are no longer bound by slow websites. In other words, we have solved our input/output bind.

One caveat: with such a powerful webscraper, it is easy to overwhelm the server of the website that you are crawling. In order to be respectful, it is best to optimize your request queue so that you target multiple websites at once. For example, if you are crawling 60 pages per second, these will be on 60 different sites. This also helps speed things up, because websites tend to slow down as they become overwhelmed by traffic.

Asynchronous Crawling

Storing data

Once we scrape a webpage, we store that data (the HTML) in a cloud-based file store for safekeeping and faster accessing in the future. This saves us from having to recrawl every site when we would like to perform a new analysis.

For the rest of this tutorial, a local file system will suffice.

Cleaning data

Cleaning data is the least glamorous aspect of an effective data pipeline, and is often the most overlooked. However, it is a crucial step in the process.

The HTML tells us a surprising lot about the website in question. For example, HTML holds information about:

  • What language the site is written in
  • Where the site is based
  • What images appear on the page
  • What the website is about
  • What the title of the particular page is
  • What the title of the particular article is
  • What the subtitle of that article is
  • What words/phrases are bold
  • What other topics this topic links to

You may be surprised by how much depth HTML adds to your understanding of a webpage. HTML forms the basis for how websites are Search Engine Optimized, and therefore categorized, throughout the web. We rank these different HTML aspects according to our view on their importance. For the sake of this tutorial, however, we’ll simply pool all of the visible text and treat it as equal.

This brings us to our first bit of code. In our case we use the BeautifulSoup library to pull all of the visible text out of our raw HTML:

from bs4 import BeautifulSoup

from bs4.element import Comment

    def tag_visible(element):

        if in ['style', 'script', 'head', 'title', 'meta', '[document]']:

            return False

        if isinstance(element, Comment):

            return False

        return True

    def text_from_html(soup):

        texts = soup.findAll(text=True)

        visible_texts = filter(tag_visible, texts)

        return u" ".join(t.strip() for t in visible_texts)

The code is pretty self explanatory. The text_from_html method makes use of Beautiful Soup’s built in findAll method to find all text and then uses the tag_visible method to filter out any content that is not supposed to be read by the website visitor.

In this case, the reason that we care only about the visible text is that we want to ensure that our Natural Language Classifier is reading the page in the same way that a normal website user would be.

As mentioned above, in this tutorial, we do not considered the difference in importance between HTML tags. As an exercise, however,  think about how you might modify the code to account for the difference between a title tag and a paragraph tag, for instance.

Creating a Training, Test, and Validation Datasets

Before we can create a Natural Language Classifier we need a training dataset. A training dataset is used for learning. It is already classified, vetted and validated data. Think of this data as a group of examples that our classifier can draw inferences from. Training sets are difficult to obtain and there is no “shortcut” or easy way out. The bottom line is that you will need to manually (and accurately) classify a subset your data that can be used to inform your classifier. This process is easier than manually classifying the entire dataset, but it is no trivial task.

One of our most straightforward use cases for NLP is categorizing our capital providers by funding type, and we have found it to be dramatically more effective than our paid analysts by comparing the rate of false positive vs. false negatives against known data.

Analysts Machine Learning Algorithm
False Positives 50% 10%
False Negatives 25% 5%

Table 1: Above compares false positive and false negatives in AlliedCrowds country categorization data. We have hundreds of thousands of country and capital provider pairings in our database and this task proved to be too onerous for our analysts. Our results improved dramatically when we switched to a Natural Language Classifier. 

One common question is how large your training set should be. The answer really depends on the task at hand. A simple enough task could have a training with one data point per group. More complex problems could require thousands of data points to differentiate between groups. As a rule, the more complete the training data, the better the the classifier performs. Therefore, you may trial increasing training dataset sizes until you are happy with results.

We gathered our training set via a painstaking process of viewing and reviewing a subset of 6000 of our capital providers until we were certain that it was 100% accurate.

Once you have your training set, however, the tools to process the texts are very accessible, straightforward, and actually fun to use! The training set is really the last obstacle.

One of the issues that can arise from fitting a classifier to a training set is overfitting. This means that the classifier performs well at picking up the nuances of our training set, but is not good at generalizing those to the larger population. Thankfully there is a very simple solution to this and that is to split your data in half, creating a test dataset.

A test dataset is independent of the training dataset, and so needs to be randomly selected. Ideally, you will randomly select it each time you retrain and test your classifier. If your model fits the training dataset and also fits the test dataset, then you can be confident that there is minimal overfitting and that it is properly generalized to the population. If not, it means that your model is overfitting the training data and you may need to either tweak some parameters or increase the size and quality of your training data.

As a final note, I will briefly touch on validation data. Validation data is used in order to tweak hyperparameters (which are out of the scope of this blog). Hyperparameters, as the name suggests are a sort of meta parameter that can help fine tune your model. One of the issues that happens through the training/testing cycle is overfitting hyperparameters. The validation dataset ensures that you are not overfitting these hyperparameters in the same way the testing data ensures that you are not overfitting your model parameters.

Natural Language Processing (NLP)

Once we have our training and test datasets, we are ready to train and test our Natural Language Classifier. To guide you through this tutorial, we will use the example mentioned above of classifying capital providers by provider type. We will simplify things by comparing microfinance institutions (MFIs) with venture capital (VC) firms.

In human terms, we are going to show our classifier a bunch of website text from MFIs’ websites and a bunch of text from VCs’ websites and allow it to “figure out” the difference. A full explanation of what is going on under the hood is out of the scope of this tutorial, but the classifier essentially uses a series of guesses and checks to determine the main differences.

Once our classifier has been trained, we can test it against our test data to see how well it learned. If we are satisfied with the classifier’s test performance, we can begin using it in production. We will also show you some simple ways to debug the classifier if things do not go as expected.

We will be using tools provided by the gensim library as well as the Natural Language Toolkit (NLTK). You do not need to know these libraries to follow along, I will explain each of the methods used.

Before we start, we will create a generator to build the texts. The generator avoids us having to store our large datasets in memory, allowing for retrieval of our texts on demand without clogging up resources.

import gensim

num_rows = 50

def build_texts():

    for subdirectory in os.listdir(f"/Directory/"):

        with open(f"/Directory/{subdirectory}/text.txt") as f:

            doc = " ".join([x.strip() for x in islice(f, num_rows)])

            yield gensim.utils.simple_preprocess(doc, deacc=True, min_len=3)      

print("Text generator initialized...")

In this example, Directory is the directory that your text is stored. We are iterating through each subdirectory in this directory, which should correspond to an object that you are interested in classifying. One thing to note here is that we are taking the first 50 rows of text from each file to make processing a bit faster. In production you will want to use the entire text data.


Although we have our raw text, there is still a bit more preprocessing that we would like to do before training our classifier.

The first step will be to create a set of the most common bigrams. Bigrams are simply a list of two adjacent words. We are interested in identifying common bigrams because these are very powerful features within our text data that help us determine overall meaning. For example, “Microfinance Institution” and “Venture Capital” are both common bigrams that tell us a lot about the meaning of the text. We can draw more conclusions about a text if we see “Venture Capital” as a single feature, than if we can only see “Venture” and “Capital” as separate words.

To do this, we first create a list from our generator and then we let the gensim package take care of the rest. You will find with a lot of NLP work the bulk of the heavy lifting is gathering, cleaning and processing the data and that common packages handle the nuances of the Natural Learning Processing itself:

train_texts = list(build_texts())

bigram_phrases = gensim.models.Phrases(train_texts, common_terms=stops)

bigram = gensim.models.phrases.Phraser(bigram_phrases)

 We can then test some common bigrams that should appear:

assert (bigram['microfinance', 'institution'][0] == "microfinance_institution")

assert (bigram['venture', 'capital'][0] == "venture_capital")

assert (bigram['solar', 'panel'][0] == "solar_panel")

assert (bigram['big', 'data'][0] == "big_data")

assert (bigram['united', 'states'][0] == "united_states")

Now that we have our bigrams trained, we are going to create a dictionary to store the provider name, provider type, and text. In this example, to keep things simple, we store a text file called provider_type.txt with the name of the provider type in the same directory as text.txt.

While creating this dictionary, we are going to do some preprocessing, as well, in order to make the text more actionable for our classifier. I will walk through and explain the significance of each of the preprocessing steps after the jump.

import nltk

from gensim.parsing.preprocessing import STOPWORDS

stops = set(stopwords.words('english')).union(set(STOPWORDS))

documents = []

    for subdirectory in os.listdir(f"/Directory/"):

        with open(f"/Directory/{subdirectory}/text.txt") as f:

            with open(f"/Directory/{subdirectory}/provider_type.txt") as pt:

                provider_type = pt.readline()

            provider = dict(name=subdirectory, provider_type=provider_type)

            text = " ".join([x.strip() for x in islice(f, num_rows)])

            tokens = [word.lower() for word in nltk.word_tokenize(text) if word.isalpha() and word.lower() not in stops]

            tokens = bigram[tokens]

            provider['tokens'] = tokens


First, we use the word_tokenize method from nltk that breaks up the string of text into words and punctuation.

We also use the isalpha method that checks if the word is alphabetic characters only, because this is all that we are interested in.

The next bit of preprocessing is to eliminate stopwords. Stopwords are the list of most common words within a lexicon, like “the”, “he”, “she”, “and”. We often don’t care about these words and they can have a negative impact on our analysis.

Next, we gather the bigrams from within the list of tokens and finally we add the tokens to our provider dictionary and we add that dictionary to our list of documents, which will later be split into our test and training datasets.

Feature Sets

Next, we want to create what are called feature sets. Feature sets are lists of features (in this case word segments) that allow our model to quantify the text.

For our features, we have chosen to use the 100 most common words, excluding any words three characters or fewer, across all documents. We make use of the nltk’s FreqDist class to identify these and have manually identified two additional features of “venture” and “micro”. These will be particularly useful for differentiating between MFIs and VC firms.

custom_word_features = ['venture', 'micro']

num_features = 100

all_words = nltk.FreqDist(word.lower() for provider in documents for word in provider['tokens'])

word_features = [word for (word, freq) in all_words.most_common(num_features) if len(word) > 3]

word_features = word_features + custom_word_features

We create a method called document_features that returns the features for a given document.

def document_features(document):

    document_words = set(document['tokens'])

    features = {}

    for word_feature in word_features:

        contains_feature = False

        for document_word in document_words:

            if word_feature in document_word:

                features[f"contains({document_word})"] = True

                contains_feature = True               

        features[f"contains({word_feature})"] = contains_feature

  return features

The above method may seem confusing, due to the nested for loops, but it’s really quite straightforward. In it, we are determining if the features (word segments) that we have identified are contained within the tokens of the analyzed document. If they are, then we add that feature to the list of features within the document, if not we exclude it as a feature. Once we have a list of features for each document, we can compare these features in order to determine which features are most applicable to each provider type (MFI or VC).

Training our classifier

Finally, we get to train our classifier!


featuresets = [(document_features(d), provider_type) for d in documents if d[‘provider_type’] in provider_types]

num_featuresets = len(featuresets)

train_set, test_set = featuresets[:math.floor(num_featuresets/2)], featuresets[math.floor(num_featuresets/2):]

classifier = nltk.NaiveBayesClassifier.train(train_set)

First, we shuffle our documents, so that we can ensure we divide them randomly between training and testing datasets.

Next, we use our document_features method above to build our featuresets, which is a tuple of document features and provider type.

Then wes divide our test sets evenly and run classifier = nltk.NaiveBayesClassifier.train(train_set). There are many different implementations of Natural Language Classifiers, but for our purposes the NaiveBayesClassifier provided by nltk works just fine and could not be easier to implement.

It is somewhat difficult to grasp that 62 lines of code in this example are dedicated to cleaning, processing and preprocessing data and that only 1 line is used for actually training the classifier, but this is the nature of data science.

Now, time to test our new classifier on our test data:

print("Classifier accuracy percent:", (nltk.classify.accuracy(classifier, test_set))*100)

This tells us the accuracy of our classifier when applied to our test data. An example of output, which is what we get when we run ours is:

Classifier accuracy percent: 91.28367670364501

This let’s us know that our classifier is about 91% accurate. This is good, but we believe we can do better. Let me show you a couple of common ways to debug this process.

One example is to show the most informative features:


This will show us the 150 most important aspects that our model used for determining its classifications. These should pass the “eyeball test”. For example, we would expect anything with “micro” or “venture” to be extremely informative features. An example of output:

Most Informative Features
contains(savings_account) = True           Microf : Ventur =     67.5 : 1.0
contains(account_opening) = True           Microf : Ventur =     50.1 : 1.0
 contains(savings_loans) = True           Microf : Ventur =     42.5 : 1.0
  contains(micro_credit) = True           Microf : Ventur =     38.8 : 1.0
contains(microfinance_bank) = True           Microf : Ventur =     38.0 : 1.0
contains(personal_loans) = True           Microf : Ventur =     35.0 : 1.0
      contains(startups) = True           Ventur : Microf =     31.4 : 1.0
   contains(marketplace) = True           Ventur : Microf =     31.1 : 1.0
contains(savings_products) = True           Microf : Ventur =     30.1 : 1.0
contains(loan_repayments) = True           Microf : Ventur =     27.4 : 1.0
contains(portfolio_companies) = True           Ventur : Microf =     26.8 : 1.0
contains(loan_application) = True           Microf : Ventur =     24.4 : 1.0
 contains(asset_finance) = True           Microf : Ventur =     23.6 : 1.0
contains(micro_businesses) = True           Microf : Ventur =     23.6 : 1.0
contains(savings_accounts) = True           Microf : Ventur =     23.3 : 1.0

This example indeed passes the eyeball test with each feature corresponding to the correct category.

contains(savings_account) = True Microf : Ventur = 67.5 : 1.0 

Tells us that if the text contains “savings account” it is 67.5 times more likely to be a Microfinance Institution than a Venture Capital firm. Pretty cool, but we haven’t figured out why our classifier is only 91% accurate.

An example of a slightly deeper analysis you can perform is the following, which prints the provider name, provider type and first 50 tokens. Through this analysis, we have actually found that the classifier is correct and our testing data is wrong!

for d in documents:

      featureset = (document_features(d), provider_type) 

            if classifier.classify(featureset[0]) == featureset[1]:







We can clearly see in the following output a couple of reasons why some of the capital providers are misclassified. For example, the first two are written in foreign languages. This teaches us that we should only try to classify texts after grouping by languages. The final example shows that the website was not properly scraped, because it requires javascript to load.

iSGS Investment Works
['Venture Capital']
['value', 'team', 'company', 'access', '五嶋一人', 'kazuhito', 'goshima', '代表取締役', '代表パートナー', '代表取締役', '佐藤真希子', 'makiko', 'sato', '取締役', '代表パートナー', '新卒一期生', 'mvp', '取締役', '代表パートナーに就任', '菅原敬', 'kei', 'sugawara', '取締役', '代表パートナー', '現アクセンチュア', 'ジャパン', 'investor', 'が発表した', 'executive', 'team', 'のインターネットセクター', 'best', 'cfo部門', '取締役', 'company', '社名', '株式会社isgs', 'インベストメントワークス', 'isgs', 'investment', 'works', '所在地', '資本金', '取締役_代表取締役', '代表パートナー', '五嶋一人', '取締役', '代表パートナー', '佐藤真希子', '取締役']

['Venture Capital']
['من', 'نحن', 'الوظائف', 'اتصل_بنا_english_العودة', 'إطلاق', 'مبادرة', 'تك', 'سباركس', 'نحن_فخورون', 'بأن', 'نعلن', 'لكم', 'عن', 'إطلاق', 'آخر', 'مبادراتنا', 'لدعم_الشركات', 'الناشئة', 'والرياديين', 'في', 'المنطقة', 'وهي', 'مبادرة', 'تك', 'سباركس', 'تهدف', 'المبادرة', 'لمساعدة', 'الرياديين', 'لبناء', 'مشاريعهم', 'والإرتقاء', 'بها', 'من', 'خلال', 'عرض', 'مقابلات', 'ملهمة', 'مع', 'شخصيات', 'بارزة', 'في', 'عالم', 'الريادة', 'من', 'موجهين', 'وأكثر', 'الهدف', 'الرئيسي', 'لمبادرة']

Al Tamimi Investments
['Venture Capital']
['javascript_required', 'enable_javascript', 'allowed', 'page']

If your classifier is not performing as you expect it, have a look at which examples it gets wrong and see if there are any common trends!


I hope you enjoyed this NLP walkthrough and it helps you or your organization make a bit more sense of Machine Learning, NLP, and Artificial Intelligence concepts.

If you’re interested in working on these kinds of projects please check our careers site, or email me at malcolm [at]!