Data Collection and Entrepreneurs in Emerging Markets

data collection emerging markets

Data. The word alone is enough to stir up some sort of reaction. Regardless, we can’t live without it. It drives our economies, our political systems, our societies. Indeed, the field of data collection and analysis shows no signs of slowing down.

However, with its rise comes a number of obstacles that have the ability to impact businesses, governments, and NGOs; these concerns are especially relevant to SMEs who are in their fundraising period, or are putting together a business plan.

So what are some of the current issues in data collection and data extraction; what do they mean for entrepreneurs and SMEs, particularly those in emerging markets; and what solution does AlliedCrowds provide?

Challenges in Data Collection

Data quality. The larger the dataset, the more susceptible it is to low-quality data. This includes:

Noisy data: data with large amounts of additional, useless information: As an example, suppose I am a social entrepreneur seeking data from an NGO on the number of Syrian refugees in Amman. However, when I receive this data, it is for the total number of refugees across the Arab world. Whilst this data may be important, it is meaningless to me, as I have a specific focus.

Poor representation in data sampling, which increases the likelihood of sampling errors: Let us assume that I want to conduct a study on the average return on investment (ROI) for cryptocurrencies. I reach out to 100,000 crypto traders, but only get responses from 20,000 — this group reports an average ROI of 65%. From this data, I conclude that gains for cryptocurrency securities are particularly strong. What I don’t realize is that the 80,000 traders experienced average losses of 40%. Therefore my data is unreliable.

Expensive to manage. The cost of tools such as database management systems and data mining software can effectively price out developing countries. This creates information asymmetry, and entrepreneurs in these environments are forced to make do with low-quality data that affects their ability to seek funding, the quality of business plans, and other factors that will negatively impact their success.

The Effect on Entrepreneurs and SMEs

Where does one begin? Due to the presence of infrastructural voids, knowing where to start looking for data is a challenge in itself. Even if one does know, having access to reliable resources that accommodate data needs is an obstacle. For instance, good internet speed is paramount to ensuring that data needs are met in an efficient manner.

However, according to content delivery network Akamai Technologies, the highest average internet bandwidth speeds in sub-Saharan Africa are in Kenya (12.2 Mbps), South Africa (6.7 Mbps), Morocco (5.2Mbps), Nigeria, (3.9Mbps), and Namibia (2.9 Mbps). This significantly affects the ability of entrepreneurs and SMEs to compete with those located in Norway (23.5 Mbps) and Sweden (22.5 Mbps), countries with the highest average internet bandwidth speeds.

Cost factors: Storage costs have certainly come a long way since the 1960s, when the price of 1GB of data was over $1 million. Today that price is around $0.02; however, storage costs can be prohibitively expensive. Cloud storage company Nasuni lists the average cost to store 1TB (1000GB) of file data is $3,351 per year.

For local backup, hardware, and replication to another site, the price is $4,000, and since it is typical for companies to hold three copies of data, the price can be as much as $17,872 per TB ($17.87 per GB). In Nigeria, where GNI per capita is $2,080, or in Kenya, where GNI per capita is $1,440, such costs are prohibitive for most SMEs.

Without access to quality data, the standard of business plans, pitches, and financial statements that entrepreneurs produce suffers. For instance, when carrying out market research into potential target audiences, poor representation in data sampling will skew an entrepreneur’s findings. So will the lack of older or comparable data.

So we see how these difficulties create serious wide-reaching implications for entrepreneurs and SMEs. Indeed, the number of inaccuracies that may be present in large datasets can stifle significant investment in research and development, especially when one considers the costs involved. Entrepreneurs therefore need access to an economical resource that provides accurate, easily-accessible data that meets their needs.

Where does AlliedCrowds come in?

The Capital Finder tool is an interactive database that is like no other in the development space. It has been built to address an information gap for SMEs, and takes into consideration the challenges outlined above.

We start by using our proprietary web scraper to gather relevant data from hundreds of thousands of web sites. This data is then stored in the cloud for safekeeping and accessibility. Next, and importantly, the data is cleaned, and then trained, tested, and validated to ensure its reliability. This process saves our clients considerable time and resources, allowing them to focus on growing their businesses.

Entrepreneurs are given free access to over 7,000 capital providers that invest in emerging markets. and have the ability to generate custom reports on these providers in an easily shareable format. We are constantly working to ensure our data is as accurate, complete and up-to-date as possible.

investor profile data collection

Importantly, our database does not require fast internet bandwidth speeds. Many of our clients operate in conditions with limited or unreliable internet access, and so we build our database mobile-first. This has the added benefit of reducing data consumption costs.

Other great features include the keyword / phrase searching capability, which allows users to enter terms like “agri-tech” or “software development”, and will run a bespoke matching algorithm to identify the capital providers most relevant to your needs.

This allows entrepreneurs to research the investment ecosystems across developing countries, both for the purpose of market research, and to identify potential funders for their business. The Capital Finder database is reliable, easy to access, and not bandwidth-heavy, making it a perfect solution for entrepreneurs in emerging markets.

Have more questions about the Capital Finder? Take a look at our demo, or get in touch at

This article was written by David Alfred-Olufeyimi.

Image: “VIP tour to the innovation hub (the iHub), Nairobi, Kenya” is licensed by ITU Pictures CC BY 2.0

Nigerian Entrepreneurs: A Fundraising Case Study

nigerian entrepreneurs

The Capital Finder allows organizations to facilitate entrepreneurs and SMEs seeking financing by providing access to unique funding opportunities. As a new intern at AlliedCrowds, I wanted to put the Capital Finder to the test. So, I decided to invent a peri-urban Nigerian cassava farming company called Cassava Inc. and see how many funders I can identify using the Capital Finder. In this blog, I will outline my journey as one of many Nigerian entrepreneurs seeking funding for their company.

Overview of SME Funding for Nigerian Entrepreneurs

Nigeria is a vibrant, diverse, and opportune place, making it well-suited for entrepreneurs. As a Nigerian myself, I wanted to explore more of what my country had to offer in the way of business opportunities. What I found is a growing startup space where fashion and tech are just some of the burgeoning industries that are contributing to the country’s economic growth and productivity. Agriculture, a mainstay of the Nigerian economy, is one of the more rapidly growing and innovative industries, with companies similar to Cassava Inc. creating technologically-driven solutions that are introducing new farming techniques.

Unfortunately, there are problems facing SMEs that are often inhibitive; these problems affect not just Nigerian entrepreneurs, but many founders around the world. Poor access to funding restricts entrepreneurs’ ability to raise capital for small businesses, and underdeveloped infrastructures and government inefficiencies exacerbate this problem. Knowing where to begin searching for capital is difficult, and class and nepotism are tough obstacles to surmount without access to the key players in any industry. Despite its promise, with all of these early obstacles, it seems like finding funding for Cassava Inc. is going to be a challenge.

The Solution: The Capital Finder

Cassava Inc. is a business with $650,000 (₦235.95 million) in annual turnover seeking $500,000 (₦181.5 million) in equity funding for a cassava processing plant in northern Nigeria. Traditional methods of processing cassava produce inconsistent results, and Cassava Inc. is looking to introduce a number of process controls that will lead to improved cassava products that are more competitive in the international market.

On my fund-seeking journey, I initially ran a Google search on “cassava SME investors in Nigeria”. After going through several pages of search results, I found myriad articles on cassava’s importance to Nigerian agriculture, and the commitment of various organisations to boost cassava-related investment, but I could not find any access to a single investor – much less of one tailored to a startup like Cassava Inc. I ran numerous additional searches, combing through hundreds of pages using search terms like “agriculture VC Nigeria”, “agriculture angel investors Nigeria”, “farming VC Nigeria”, “funding Nigerian entrepreneurs”, and about fifteen more. In total, I spent two hours looking for funders on Google, resulting in just two viable options. It seems that Google is a great resource for providing an overview of an industry, but due to the peculiarities of finding relevant funders in the developing world, it turned out to be a bad resource for funding.

nigerian entrepreneurs google results

Using the Capital Finder, I ran a search tailored to Cassava Inc. I searched the agriculture sector and was able enter keywords such as “cassava”, to further refine my search. The Capital Finder’s bespoke matching algorithm searches through over 7,000 capital providers to surface the platforms most relevant to the cassava industry in Nigeria.

From the list provided, I was able to extract five capital providers relevant not only to my sector and country, but also my funding type and funding size! The Capital Finder also pulled out the context in which “cassava” was mentioned. For instance, my top search result, Sahel Capital Agribusiness Managers Limited, “is a private equity firm focused exclusively on the Nigerian agribusiness sector” whose website features an article on cassava as a staple Nigerian crop. Other results included:

  • Grow Africa– a joint initiative between the African Union, the New Partnership for Africa’s Development, and the World Economic Forum with a sector focus that includes the rice and cassava business in Nigeria
  • African Finance Corporation– an infrastructure investment firm with a sector focus on natural resources, including “arable land”
  • Verod Capital Management– a private equity firm active across various sectors, including agriculture
  • Synergy Capital Managers– a private equity firm focused on making investment in select high-growth sectors in Nigeria and Ghana, including agri-processing.

Other results included CardinalStone, an investment bank investing in key growth sectors of Nigeria’s economy, and Kaizen Venture Partners, a venture capital firm whose sector focus includes agribusiness. These providers make minimum investments of $2 million and $3 million, respectively, and therefore would be good options when Cassava Inc grows.

A similar search on a search engine would be far more exhaustive, providing me with generic results on anything and everything related to “cassava.” For entrepreneurs, time is scarce and I was able to find the investors above in a matter of minutes, not hours. This allows entrepreneurs to focus on what they do best — running their company, not looking through search results in hope of finding a suitable investor.

So, now that I know who the investors are in the cassava space, my next step is actually securing funding for the farm. This can be a challenge for Nigerian entrepreneurs (and others) who haven’t raised money in the past. Fortunately, AlliedCrowds’s Entrepreneur Hub is a customisable platform that provides users with the informational tools to unlock private capital. 

Have more questions about the Capital Finder or Entrepreneur Hub? Get in touch at You can also learn more about SME access to finance in Nigeria in our report on the topic here.

Written by David Alfred-Olufeyimi.

Image: Fabricated cassava processing machines by IITA is licensed under CC BY NC

Generating PDF Reports Programmatically in Python Using API Data

The following was written by CTO Malcolm Kapuza, who dives into some example code showing how to use the Capital Finder API to create the alternative finance directories we put out on a regular basis.

In this blog we’re going to walk through how we generate our alternative finance directories programmatically. If you are not familiar with these, you can check out our reports page or view our latest directory on cryptocurrencies in emerging markets here. These directories use our Capital Finder API as a data source, but present it in a visually appealing, digestible, and shareable way. This has been very useful in allowing our clients to understand the kind of data that we have, without having to query our API directly.

These directories have been very popular and proven to be a great way to share our data. If you are reading this, you may be sitting on a lot of valuable data, but struggling for a way to engage your audience. Reports are a concise and aesthetic way to do this.

Your data will undoubtedly be different from ours. However, we hope that this tutorial demystifies the process and gives you ideas for how you can use this process within your own organization to create visually appealing reports and engage your audience.

I have embedded a Jupyter notebook below to better narrate the explanation. Please have a look at the following example directory, generated from the code, here before reading the notebook, as it will make following along much easier. Also, you can find the HTML used to template the report at the bottom of the page.

If you are interested in trying this notebook out using our data, please email us at capital.finder (at) and we will supply you a temporary key that will allow you to generate your own unique directories for use within your organization.

Report Table Screenshot

The following table is referenced in the Jupyter Notebook below.

Crowdfunding Country Table

Figure 1: Example output for semi-active and active country table

Jupyter Notebook

This Jupyter Notebook details how to Generate PDF reports programmatically in Python using our Capital Finder API.


This gist contains the HTML used for templating the report generated in this blog.

Alternative Finance Data for Emerging Markets: Natural Language Processing (Part III)

The following was written by CTO Malcolm Kapuza, who dives into some example code showing how alternative finance data in the Capital Finder is categorized, using machine learning and natural language processing.

As I mentioned in the first part of this three part series, the Capital Finder has many useful applications. In the second part, I discussed in detail where we source our different data points and how we turn this data into one coherent and intuitive view of developing world alternative finance. In this final part of the series, I walk readers through the steps — and even some of the actual code — that makes this all possible.

By the end of this blog, you will learn all of the high level steps for creating your own data pipeline, including some of the more detailed steps of using Machine Learning to build a Natural Language Classifier.

In order to try out the code examples yourself, you will need your own dataset to process. You can find datasets online to practice with; for example, the UCI Machine Learning Repository, or the NLTK python package, which also provides easily accessible data for you to work with.

Gathering alternative finance data

The first step in any data pipeline is gathering data. At AlliedCrowds, we have created a proprietary web scraper that makes use of the python asyncio library to make concurrent HTTP requests. This allows us to scale to hundreds of thousands of web pages per day on a standard machine.

Asynchronous functionality is important if you need to scale a process that is input/output bound. With web crawling, the process is bound by the network speed and your target website’s response time. The average time that it takes to load an uncached website is approximately 5 seconds. Needless to say 5 seconds per request is unacceptable, given that we will be making millions of such requests (e.g. 5 millions seconds is ~58 days).

Rather than making one request to one server and waiting for a response, we make as many requests as our CPU can process and our RAM can hold in memory. This means that we are no longer bound by slow websites. In other words, we have solved our input/output bind.

One caveat: with such a powerful webscraper, it is easy to overwhelm the server of the website that you are crawling. In order to be respectful, it is best to optimize your request queue so that you target multiple websites at once. For example, if you are crawling 60 pages per second, these will be on 60 different sites. This also helps speed things up, because websites tend to slow down as they become overwhelmed by traffic.

Asynchronous Crawling

Storing data

Once we scrape a webpage, we store that data (the HTML) in a cloud-based file store for safekeeping and faster accessing in the future. This saves us from having to recrawl every site when we would like to perform a new analysis.

For the rest of this tutorial, a local file system will suffice.

Cleaning data

Cleaning data is the least glamorous aspect of an effective data pipeline, and is often the most overlooked. However, it is a crucial step in the process.

The HTML tells us a surprising lot about the website in question. For example, HTML holds information about:

  • What language the site is written in
  • Where the site is based
  • What images appear on the page
  • What the website is about
  • What the title of the particular page is
  • What the title of the particular article is
  • What the subtitle of that article is
  • What words/phrases are bold
  • What other topics this topic links to

You may be surprised by how much depth HTML adds to your understanding of a webpage. HTML forms the basis for how websites are Search Engine Optimized, and therefore categorized, throughout the web. We rank these different HTML aspects according to our view on their importance. For the sake of this tutorial, however, we’ll simply pool all of the visible text and treat it as equal.

This brings us to our first bit of code. In our case we use the BeautifulSoup library to pull all of the visible text out of our raw HTML:

from bs4 import BeautifulSoup

from bs4.element import Comment

    def tag_visible(element):

        if in ['style', 'script', 'head', 'title', 'meta', '[document]']:

            return False

        if isinstance(element, Comment):

            return False

        return True

    def text_from_html(soup):

        texts = soup.findAll(text=True)

        visible_texts = filter(tag_visible, texts)

        return u" ".join(t.strip() for t in visible_texts)

The code is pretty self explanatory. The text_from_html method makes use of Beautiful Soup’s built in findAll method to find all text and then uses the tag_visible method to filter out any content that is not supposed to be read by the website visitor.

In this case, the reason that we care only about the visible text is that we want to ensure that our Natural Language Classifier is reading the page in the same way that a normal website user would be.

As mentioned above, in this tutorial, we do not considered the difference in importance between HTML tags. As an exercise, however,  think about how you might modify the code to account for the difference between a title tag and a paragraph tag, for instance.

Creating a Training, Test, and Validation Datasets

Before we can create a Natural Language Classifier we need a training dataset. A training dataset is used for learning. It is already classified, vetted and validated data. Think of this data as a group of examples that our classifier can draw inferences from. Training sets are difficult to obtain and there is no “shortcut” or easy way out. The bottom line is that you will need to manually (and accurately) classify a subset your data that can be used to inform your classifier. This process is easier than manually classifying the entire dataset, but it is no trivial task.

One of our most straightforward use cases for NLP is categorizing our capital providers by funding type, and we have found it to be dramatically more effective than our paid analysts by comparing the rate of false positive vs. false negatives against known data.

Analysts Machine Learning Algorithm
False Positives 50% 10%
False Negatives 25% 5%

Table 1: Above compares false positive and false negatives in AlliedCrowds country categorization data. We have hundreds of thousands of country and capital provider pairings in our database and this task proved to be too onerous for our analysts. Our results improved dramatically when we switched to a Natural Language Classifier. 

One common question is how large your training set should be. The answer really depends on the task at hand. A simple enough task could have a training with one data point per group. More complex problems could require thousands of data points to differentiate between groups. As a rule, the more complete the training data, the better the the classifier performs. Therefore, you may trial increasing training dataset sizes until you are happy with results.

We gathered our training set via a painstaking process of viewing and reviewing a subset of 6000 of our capital providers until we were certain that it was 100% accurate.

Once you have your training set, however, the tools to process the texts are very accessible, straightforward, and actually fun to use! The training set is really the last obstacle.

One of the issues that can arise from fitting a classifier to a training set is overfitting. This means that the classifier performs well at picking up the nuances of our training set, but is not good at generalizing those to the larger population. Thankfully there is a very simple solution to this and that is to split your data in half, creating a test dataset.

A test dataset is independent of the training dataset, and so needs to be randomly selected. Ideally, you will randomly select it each time you retrain and test your classifier. If your model fits the training dataset and also fits the test dataset, then you can be confident that there is minimal overfitting and that it is properly generalized to the population. If not, it means that your model is overfitting the training data and you may need to either tweak some parameters or increase the size and quality of your training data.

As a final note, I will briefly touch on validation data. Validation data is used in order to tweak hyperparameters (which are out of the scope of this blog). Hyperparameters, as the name suggests are a sort of meta parameter that can help fine tune your model. One of the issues that happens through the training/testing cycle is overfitting hyperparameters. The validation dataset ensures that you are not overfitting these hyperparameters in the same way the testing data ensures that you are not overfitting your model parameters.

Natural Language Processing (NLP)

Once we have our training and test datasets, we are ready to train and test our Natural Language Classifier. To guide you through this tutorial, we will use the example mentioned above of classifying capital providers by provider type. We will simplify things by comparing microfinance institutions (MFIs) with venture capital (VC) firms.

In human terms, we are going to show our classifier a bunch of website text from MFIs’ websites and a bunch of text from VCs’ websites and allow it to “figure out” the difference. A full explanation of what is going on under the hood is out of the scope of this tutorial, but the classifier essentially uses a series of guesses and checks to determine the main differences.

Once our classifier has been trained, we can test it against our test data to see how well it learned. If we are satisfied with the classifier’s test performance, we can begin using it in production. We will also show you some simple ways to debug the classifier if things do not go as expected.

We will be using tools provided by the gensim library as well as the Natural Language Toolkit (NLTK). You do not need to know these libraries to follow along, I will explain each of the methods used.

Before we start, we will create a generator to build the texts. The generator avoids us having to store our large datasets in memory, allowing for retrieval of our texts on demand without clogging up resources.

import gensim

num_rows = 50

def build_texts():

    for subdirectory in os.listdir(f"/Directory/"):

        with open(f"/Directory/{subdirectory}/text.txt") as f:

            doc = " ".join([x.strip() for x in islice(f, num_rows)])

            yield gensim.utils.simple_preprocess(doc, deacc=True, min_len=3)      

print("Text generator initialized...")

In this example, Directory is the directory that your text is stored. We are iterating through each subdirectory in this directory, which should correspond to an object that you are interested in classifying. One thing to note here is that we are taking the first 50 rows of text from each file to make processing a bit faster. In production you will want to use the entire text data.


Although we have our raw text, there is still a bit more preprocessing that we would like to do before training our classifier.

The first step will be to create a set of the most common bigrams. Bigrams are simply a list of two adjacent words. We are interested in identifying common bigrams because these are very powerful features within our text data that help us determine overall meaning. For example, “Microfinance Institution” and “Venture Capital” are both common bigrams that tell us a lot about the meaning of the text. We can draw more conclusions about a text if we see “Venture Capital” as a single feature, than if we can only see “Venture” and “Capital” as separate words.

To do this, we first create a list from our generator and then we let the gensim package take care of the rest. You will find with a lot of NLP work the bulk of the heavy lifting is gathering, cleaning and processing the data and that common packages handle the nuances of the Natural Learning Processing itself:

train_texts = list(build_texts())

bigram_phrases = gensim.models.Phrases(train_texts, common_terms=stops)

bigram = gensim.models.phrases.Phraser(bigram_phrases)

 We can then test some common bigrams that should appear:

assert (bigram['microfinance', 'institution'][0] == "microfinance_institution")

assert (bigram['venture', 'capital'][0] == "venture_capital")

assert (bigram['solar', 'panel'][0] == "solar_panel")

assert (bigram['big', 'data'][0] == "big_data")

assert (bigram['united', 'states'][0] == "united_states")

Now that we have our bigrams trained, we are going to create a dictionary to store the provider name, provider type, and text. In this example, to keep things simple, we store a text file called provider_type.txt with the name of the provider type in the same directory as text.txt.

While creating this dictionary, we are going to do some preprocessing, as well, in order to make the text more actionable for our classifier. I will walk through and explain the significance of each of the preprocessing steps after the jump.

import nltk

from gensim.parsing.preprocessing import STOPWORDS

stops = set(stopwords.words('english')).union(set(STOPWORDS))

documents = []

    for subdirectory in os.listdir(f"/Directory/"):

        with open(f"/Directory/{subdirectory}/text.txt") as f:

            with open(f"/Directory/{subdirectory}/provider_type.txt") as pt:

                provider_type = pt.readline()

            provider = dict(name=subdirectory, provider_type=provider_type)

            text = " ".join([x.strip() for x in islice(f, num_rows)])

            tokens = [word.lower() for word in nltk.word_tokenize(text) if word.isalpha() and word.lower() not in stops]

            tokens = bigram[tokens]

            provider['tokens'] = tokens


First, we use the word_tokenize method from nltk that breaks up the string of text into words and punctuation.

We also use the isalpha method that checks if the word is alphabetic characters only, because this is all that we are interested in.

The next bit of preprocessing is to eliminate stopwords. Stopwords are the list of most common words within a lexicon, like “the”, “he”, “she”, “and”. We often don’t care about these words and they can have a negative impact on our analysis.

Next, we gather the bigrams from within the list of tokens and finally we add the tokens to our provider dictionary and we add that dictionary to our list of documents, which will later be split into our test and training datasets.

Feature Sets

Next, we want to create what are called feature sets. Feature sets are lists of features (in this case word segments) that allow our model to quantify the text.

For our features, we have chosen to use the 100 most common words, excluding any words three characters or fewer, across all documents. We make use of the nltk’s FreqDist class to identify these and have manually identified two additional features of “venture” and “micro”. These will be particularly useful for differentiating between MFIs and VC firms.

custom_word_features = ['venture', 'micro']

num_features = 100

all_words = nltk.FreqDist(word.lower() for provider in documents for word in provider['tokens'])

word_features = [word for (word, freq) in all_words.most_common(num_features) if len(word) > 3]

word_features = word_features + custom_word_features

We create a method called document_features that returns the features for a given document.

def document_features(document):

    document_words = set(document['tokens'])

    features = {}

    for word_feature in word_features:

        contains_feature = False

        for document_word in document_words:

            if word_feature in document_word:

                features[f"contains({document_word})"] = True

                contains_feature = True               

        features[f"contains({word_feature})"] = contains_feature

  return features

The above method may seem confusing, due to the nested for loops, but it’s really quite straightforward. In it, we are determining if the features (word segments) that we have identified are contained within the tokens of the analyzed document. If they are, then we add that feature to the list of features within the document, if not we exclude it as a feature. Once we have a list of features for each document, we can compare these features in order to determine which features are most applicable to each provider type (MFI or VC).

Training our classifier

Finally, we get to train our classifier!


featuresets = [(document_features(d), provider_type) for d in documents if d[‘provider_type’] in provider_types]

num_featuresets = len(featuresets)

train_set, test_set = featuresets[:math.floor(num_featuresets/2)], featuresets[math.floor(num_featuresets/2):]

classifier = nltk.NaiveBayesClassifier.train(train_set)

First, we shuffle our documents, so that we can ensure we divide them randomly between training and testing datasets.

Next, we use our document_features method above to build our featuresets, which is a tuple of document features and provider type.

Then wes divide our test sets evenly and run classifier = nltk.NaiveBayesClassifier.train(train_set). There are many different implementations of Natural Language Classifiers, but for our purposes the NaiveBayesClassifier provided by nltk works just fine and could not be easier to implement.

It is somewhat difficult to grasp that 62 lines of code in this example are dedicated to cleaning, processing and preprocessing data and that only 1 line is used for actually training the classifier, but this is the nature of data science.

Now, time to test our new classifier on our test data:

print("Classifier accuracy percent:", (nltk.classify.accuracy(classifier, test_set))*100)

This tells us the accuracy of our classifier when applied to our test data. An example of output, which is what we get when we run ours is:

Classifier accuracy percent: 91.28367670364501

This let’s us know that our classifier is about 91% accurate. This is good, but we believe we can do better. Let me show you a couple of common ways to debug this process.

One example is to show the most informative features:


This will show us the 150 most important aspects that our model used for determining its classifications. These should pass the “eyeball test”. For example, we would expect anything with “micro” or “venture” to be extremely informative features. An example of output:

Most Informative Features
contains(savings_account) = True           Microf : Ventur =     67.5 : 1.0
contains(account_opening) = True           Microf : Ventur =     50.1 : 1.0
 contains(savings_loans) = True           Microf : Ventur =     42.5 : 1.0
  contains(micro_credit) = True           Microf : Ventur =     38.8 : 1.0
contains(microfinance_bank) = True           Microf : Ventur =     38.0 : 1.0
contains(personal_loans) = True           Microf : Ventur =     35.0 : 1.0
      contains(startups) = True           Ventur : Microf =     31.4 : 1.0
   contains(marketplace) = True           Ventur : Microf =     31.1 : 1.0
contains(savings_products) = True           Microf : Ventur =     30.1 : 1.0
contains(loan_repayments) = True           Microf : Ventur =     27.4 : 1.0
contains(portfolio_companies) = True           Ventur : Microf =     26.8 : 1.0
contains(loan_application) = True           Microf : Ventur =     24.4 : 1.0
 contains(asset_finance) = True           Microf : Ventur =     23.6 : 1.0
contains(micro_businesses) = True           Microf : Ventur =     23.6 : 1.0
contains(savings_accounts) = True           Microf : Ventur =     23.3 : 1.0

This example indeed passes the eyeball test with each feature corresponding to the correct category.

contains(savings_account) = True Microf : Ventur = 67.5 : 1.0 

Tells us that if the text contains “savings account” it is 67.5 times more likely to be a Microfinance Institution than a Venture Capital firm. Pretty cool, but we haven’t figured out why our classifier is only 91% accurate.

An example of a slightly deeper analysis you can perform is the following, which prints the provider name, provider type and first 50 tokens. Through this analysis, we have actually found that the classifier is correct and our testing data is wrong!

for d in documents:

      featureset = (document_features(d), provider_type) 

            if classifier.classify(featureset[0]) == featureset[1]:







We can clearly see in the following output a couple of reasons why some of the capital providers are misclassified. For example, the first two are written in foreign languages. This teaches us that we should only try to classify texts after grouping by languages. The final example shows that the website was not properly scraped, because it requires javascript to load.

iSGS Investment Works
['Venture Capital']
['value', 'team', 'company', 'access', '五嶋一人', 'kazuhito', 'goshima', '代表取締役', '代表パートナー', '代表取締役', '佐藤真希子', 'makiko', 'sato', '取締役', '代表パートナー', '新卒一期生', 'mvp', '取締役', '代表パートナーに就任', '菅原敬', 'kei', 'sugawara', '取締役', '代表パートナー', '現アクセンチュア', 'ジャパン', 'investor', 'が発表した', 'executive', 'team', 'のインターネットセクター', 'best', 'cfo部門', '取締役', 'company', '社名', '株式会社isgs', 'インベストメントワークス', 'isgs', 'investment', 'works', '所在地', '資本金', '取締役_代表取締役', '代表パートナー', '五嶋一人', '取締役', '代表パートナー', '佐藤真希子', '取締役']

['Venture Capital']
['من', 'نحن', 'الوظائف', 'اتصل_بنا_english_العودة', 'إطلاق', 'مبادرة', 'تك', 'سباركس', 'نحن_فخورون', 'بأن', 'نعلن', 'لكم', 'عن', 'إطلاق', 'آخر', 'مبادراتنا', 'لدعم_الشركات', 'الناشئة', 'والرياديين', 'في', 'المنطقة', 'وهي', 'مبادرة', 'تك', 'سباركس', 'تهدف', 'المبادرة', 'لمساعدة', 'الرياديين', 'لبناء', 'مشاريعهم', 'والإرتقاء', 'بها', 'من', 'خلال', 'عرض', 'مقابلات', 'ملهمة', 'مع', 'شخصيات', 'بارزة', 'في', 'عالم', 'الريادة', 'من', 'موجهين', 'وأكثر', 'الهدف', 'الرئيسي', 'لمبادرة']

Al Tamimi Investments
['Venture Capital']
['javascript_required', 'enable_javascript', 'allowed', 'page']

If your classifier is not performing as you expect it, have a look at which examples it gets wrong and see if there are any common trends!


I hope you enjoyed this NLP walkthrough and it helps you or your organization make a bit more sense of Machine Learning, NLP, and Artificial Intelligence concepts.

If you’re interested in working on these kinds of projects please check our careers site, or email me at malcolm [at]!

Alternative Finance Data for Emerging Markets: Building the Capital Finder (Part II)

The following was written by CTO Malcolm Kapuza, who explains how alternative finance data in the Capital Finder is collected, how funders are categorized, and how we use natural language processing and machine learning to enhance our database.

As I mentioned in the first part of this three part series, the Capital Finder has many useful applications. In this second part, I go into detail about where we source our different alternative finance data points and how we turn this data into one coherent and intuitive view of developing world alternative finance. In the final part of the series, I walk readers through the steps — and even some of the actual code — that makes this all possible.

This is a semi-technical explanation to show what’s “under the hood,” and to demonstrate what we do at AlliedCrowds that is so different from other organizations in this space.

Sourcing and maintaining the data

The key to a product like the Capital Finder is having the most current and accurate data in the industry. There are a couple of ways that we stay on top of this to ensure that our clients are able make strategic decisions and inform research based on the best available data out there.

Firstly, we have a team of multilingual analysts, spread throughout the developing world. Local knowledge is crucial in the sourcing process, as it minimizes language barriers and unforeseen geographic constraints. Our analysts have deep understanding of the alternative finance space, meaning finding new capital providers is relatively easy.

Secondly, we have developed deep industry connections through our time analyzing and researching this space, which means that we are one of the first to know when a new funder opens up, or an existing one expands into one of our target geographies. .

Finally, we are constantly pushing the envelope when it comes to innovating with technology solutions at AlliedCrowds. Therefore, we have developed programmatic processes that flag new capital providers that emerge and alert us when certain information in our database has gone stale.

Finding suitability

The quality and accuracy of our alternative finance data is only one aspect of what we do. The other is our ability to pair relevant projects with capital providers, as well as categorize capital providers based on continually changing criteria.

Given the sheer size of our data and the relative small size of our team, we depend on cutting edge technologies to make this possible. I will outline 3 use cases to give some idea of why this works so well and how we’re able to do it with minimal resources.

Alternative finance data collection

AlliedCrowds has streamlined and continues to improve our data collection process. Our main focus is to increase automation, while also improving accuracy and data integrity.

We gather text from thousands of websites and millions of web pages, creating a data warehouse of text. Recent developments in Natural Language Processing (NLP) have allowed us to rapidly and continuously improve our view of the space, because it takes only moments to process all of this text rather than months to revisit every website individually. This means that our insights are faster, more accurate and more scalable than if this analysis were completed manually.

africa impact investing

An example of the sort of data we can showcase using the Capital Finder.

Additionally, we source data from 3rd party APIs in order to get a more complete and accurate view of each capital provider. These sources are invaluable, each giving us unique and actionable data, and each working as a trusted check and balance to allow us to spot anomalies. These include social media platforms, news agencies, development institutions, public records, etc.

We crowdsource information that we cannot collect programmatically and which would be prohibitively expensive to collect through analysts. We have used technology to create straightforward ways for providers from our database to deliver us valuable information. This streamlined process is simple and incentivized with increased visibility on the Capital Finder and inbound traffic, which has lead to high engagement.

Intelligent Categorization      

As our technology advances, we have begun to reduce the workload of our analyst team. For instance, analysts do not fill in country- and sector-level information, because we have found our algorithms to be more accurate, faster, and more cost efficient than their user input input. This reduces human error in our data and allows us to scale massively.

Through this process, we are close to eliminating much of the decision-making from our analyst roles. The goal is for every entry to be fact based and subsequently fact-checked, which eliminates the need for timely and costly training and removes ambiguity from the data collection processes.This also ensures scalability and consistency throughout the system, as well as a much lower maintenance costs, as we eliminate dependence on any given analyst’s specific skill set or expertise. The system channels most of the decision-making and reasoning to the highest level and therefore allows the Capital Finder to be managed centrally. Anything that an analyst does to improve the system gets spread throughout all 138 countries, as well as all sectors and provider types.

Bespoke matching

The advances have allowed us to create a distinctly effective matching system. With our data pool, we are able to comb through millions of web pages to target specific keywords and phrases on these sites and since we track social followings, recent trends, public filing information, and unique provider statistics, we are able to gauge the suitability of a certain project to each provider.

The goal of our matching algorithm is much like that of Google’s Pagerank algorithm. When you use Google to query a phrase, Google returns not only every website that is relevant to your search, but orders them based on which it has determined will be most relevant. Since this algorithm is so effective, you rarely look beyond the first or second search result. We are making it so that finding a capital provider is just as easy.

Stay tuned for the next post where I discuss in more technical detail the nuances of what makes the Capital Finder so effective.