Generating PDF Reports Programmatically in Python Using API Data

The following was written by CTO Malcolm Kapuza, who dives into some example code showing how to use the Capital Finder API to create the alternative finance directories we put out on a regular basis.

In this blog we’re going to walk through how we generate our alternative finance directories programmatically. If you are not familiar with these, you can check out our reports page or view our latest directory on cryptocurrencies in emerging markets here. These directories use our Capital Finder API as a data source, but present it in a visually appealing, digestible, and shareable way. This has been very useful in allowing our clients to understand the kind of data that we have, without having to query our API directly.

These directories have been very popular and proven to be a great way to share our data. If you are reading this, you may be sitting on a lot of valuable data, but struggling for a way to engage your audience. Reports are a concise and aesthetic way to do this.

Your data will undoubtedly be different from ours. However, we hope that this tutorial demystifies the process and gives you ideas for how you can use this process within your own organization to create visually appealing reports and engage your audience.

I have embedded a Jupyter notebook below to better narrate the explanation. Please have a look at the following example directory, generated from the code, here before reading the notebook, as it will make following along much easier. Also, you can find the HTML used to template the report at the bottom of the page.

If you are interested in trying this notebook out using our data, please email us at capital.finder (at) alliedcrowds.com and we will supply you a temporary key that will allow you to generate your own unique directories for use within your organization.

Report Table Screenshot

The following table is referenced in the Jupyter Notebook below.

Crowdfunding Country Table

Figure 1: Example output for semi-active and active country table

Jupyter Notebook

This Jupyter Notebook details how to Generate PDF reports programmatically in Python using our Capital Finder API.

report_template.html

This gist contains the HTML used for templating the report generated in this blog.

Alternative Finance Data for Emerging Markets: Natural Language Processing (Part III)

The following was written by CTO Malcolm Kapuza, who dives into some example code showing how alternative finance data in the Capital Finder is categorized, using machine learning and natural language processing.

As I mentioned in the first part of this three part series, the Capital Finder has many useful applications. In the second part, I discussed in detail where we source our different data points and how we turn this data into one coherent and intuitive view of developing world alternative finance. In this final part of the series, I walk readers through the steps — and even some of the actual code — that makes this all possible.

By the end of this blog, you will learn all of the high level steps for creating your own data pipeline, including some of the more detailed steps of using Machine Learning to build a Natural Language Classifier.

In order to try out the code examples yourself, you will need your own dataset to process. You can find datasets online to practice with; for example, the UCI Machine Learning Repository, or the NLTK python package, which also provides easily accessible data for you to work with.

Gathering alternative finance data

The first step in any data pipeline is gathering data. At AlliedCrowds, we have created a proprietary web scraper that makes use of the python asyncio library to make concurrent HTTP requests. This allows us to scale to hundreds of thousands of web pages per day on a standard machine.

Asynchronous functionality is important if you need to scale a process that is input/output bound. With web crawling, the process is bound by the network speed and your target website’s response time. The average time that it takes to load an uncached website is approximately 5 seconds. Needless to say 5 seconds per request is unacceptable, given that we will be making millions of such requests (e.g. 5 millions seconds is ~58 days).

Rather than making one request to one server and waiting for a response, we make as many requests as our CPU can process and our RAM can hold in memory. This means that we are no longer bound by slow websites. In other words, we have solved our input/output bind.

One caveat: with such a powerful webscraper, it is easy to overwhelm the server of the website that you are crawling. In order to be respectful, it is best to optimize your request queue so that you target multiple websites at once. For example, if you are crawling 60 pages per second, these will be on 60 different sites. This also helps speed things up, because websites tend to slow down as they become overwhelmed by traffic.

Asynchronous Crawling

Storing data

Once we scrape a webpage, we store that data (the HTML) in a cloud-based file store for safekeeping and faster accessing in the future. This saves us from having to recrawl every site when we would like to perform a new analysis.

For the rest of this tutorial, a local file system will suffice.

Cleaning data

Cleaning data is the least glamorous aspect of an effective data pipeline, and is often the most overlooked. However, it is a crucial step in the process.

The HTML tells us a surprising lot about the website in question. For example, HTML holds information about:

  • What language the site is written in
  • Where the site is based
  • What images appear on the page
  • What the website is about
  • What the title of the particular page is
  • What the title of the particular article is
  • What the subtitle of that article is
  • What words/phrases are bold
  • What other topics this topic links to

You may be surprised by how much depth HTML adds to your understanding of a webpage. HTML forms the basis for how websites are Search Engine Optimized, and therefore categorized, throughout the web. We rank these different HTML aspects according to our view on their importance. For the sake of this tutorial, however, we’ll simply pool all of the visible text and treat it as equal.

This brings us to our first bit of code. In our case we use the BeautifulSoup library to pull all of the visible text out of our raw HTML:

from bs4 import BeautifulSoup

from bs4.element import Comment

    def tag_visible(element):

        if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:

            return False

        if isinstance(element, Comment):

            return False

        return True


    def text_from_html(soup):

        texts = soup.findAll(text=True)

        visible_texts = filter(tag_visible, texts)

        return u" ".join(t.strip() for t in visible_texts)

The code is pretty self explanatory. The text_from_html method makes use of Beautiful Soup’s built in findAll method to find all text and then uses the tag_visible method to filter out any content that is not supposed to be read by the website visitor.

In this case, the reason that we care only about the visible text is that we want to ensure that our Natural Language Classifier is reading the page in the same way that a normal website user would be.

As mentioned above, in this tutorial, we do not considered the difference in importance between HTML tags. As an exercise, however,  think about how you might modify the code to account for the difference between a title tag and a paragraph tag, for instance.

Creating a Training, Test, and Validation Datasets

Before we can create a Natural Language Classifier we need a training dataset. A training dataset is used for learning. It is already classified, vetted and validated data. Think of this data as a group of examples that our classifier can draw inferences from. Training sets are difficult to obtain and there is no “shortcut” or easy way out. The bottom line is that you will need to manually (and accurately) classify a subset your data that can be used to inform your classifier. This process is easier than manually classifying the entire dataset, but it is no trivial task.

One of our most straightforward use cases for NLP is categorizing our capital providers by funding type, and we have found it to be dramatically more effective than our paid analysts by comparing the rate of false positive vs. false negatives against known data.

Analysts Machine Learning Algorithm
False Positives 50% 10%
False Negatives 25% 5%

Table 1: Above compares false positive and false negatives in AlliedCrowds country categorization data. We have hundreds of thousands of country and capital provider pairings in our database and this task proved to be too onerous for our analysts. Our results improved dramatically when we switched to a Natural Language Classifier. 

One common question is how large your training set should be. The answer really depends on the task at hand. A simple enough task could have a training with one data point per group. More complex problems could require thousands of data points to differentiate between groups. As a rule, the more complete the training data, the better the the classifier performs. Therefore, you may trial increasing training dataset sizes until you are happy with results.

We gathered our training set via a painstaking process of viewing and reviewing a subset of 6000 of our capital providers until we were certain that it was 100% accurate.

Once you have your training set, however, the tools to process the texts are very accessible, straightforward, and actually fun to use! The training set is really the last obstacle.

One of the issues that can arise from fitting a classifier to a training set is overfitting. This means that the classifier performs well at picking up the nuances of our training set, but is not good at generalizing those to the larger population. Thankfully there is a very simple solution to this and that is to split your data in half, creating a test dataset.

A test dataset is independent of the training dataset, and so needs to be randomly selected. Ideally, you will randomly select it each time you retrain and test your classifier. If your model fits the training dataset and also fits the test dataset, then you can be confident that there is minimal overfitting and that it is properly generalized to the population. If not, it means that your model is overfitting the training data and you may need to either tweak some parameters or increase the size and quality of your training data.

As a final note, I will briefly touch on validation data. Validation data is used in order to tweak hyperparameters (which are out of the scope of this blog). Hyperparameters, as the name suggests are a sort of meta parameter that can help fine tune your model. One of the issues that happens through the training/testing cycle is overfitting hyperparameters. The validation dataset ensures that you are not overfitting these hyperparameters in the same way the testing data ensures that you are not overfitting your model parameters.

Natural Language Processing (NLP)

Once we have our training and test datasets, we are ready to train and test our Natural Language Classifier. To guide you through this tutorial, we will use the example mentioned above of classifying capital providers by provider type. We will simplify things by comparing microfinance institutions (MFIs) with venture capital (VC) firms.

In human terms, we are going to show our classifier a bunch of website text from MFIs’ websites and a bunch of text from VCs’ websites and allow it to “figure out” the difference. A full explanation of what is going on under the hood is out of the scope of this tutorial, but the classifier essentially uses a series of guesses and checks to determine the main differences.

Once our classifier has been trained, we can test it against our test data to see how well it learned. If we are satisfied with the classifier’s test performance, we can begin using it in production. We will also show you some simple ways to debug the classifier if things do not go as expected.

We will be using tools provided by the gensim library as well as the Natural Language Toolkit (NLTK). You do not need to know these libraries to follow along, I will explain each of the methods used.

Before we start, we will create a generator to build the texts. The generator avoids us having to store our large datasets in memory, allowing for retrieval of our texts on demand without clogging up resources.

import gensim

num_rows = 50

def build_texts():

    for subdirectory in os.listdir(f"/Directory/"):

        with open(f"/Directory/{subdirectory}/text.txt") as f:

            doc = " ".join([x.strip() for x in islice(f, num_rows)])

            yield gensim.utils.simple_preprocess(doc, deacc=True, min_len=3)      

print("Text generator initialized...")

In this example, Directory is the directory that your text is stored. We are iterating through each subdirectory in this directory, which should correspond to an object that you are interested in classifying. One thing to note here is that we are taking the first 50 rows of text from each file to make processing a bit faster. In production you will want to use the entire text data.

Preprocessing

Although we have our raw text, there is still a bit more preprocessing that we would like to do before training our classifier.

The first step will be to create a set of the most common bigrams. Bigrams are simply a list of two adjacent words. We are interested in identifying common bigrams because these are very powerful features within our text data that help us determine overall meaning. For example, “Microfinance Institution” and “Venture Capital” are both common bigrams that tell us a lot about the meaning of the text. We can draw more conclusions about a text if we see “Venture Capital” as a single feature, than if we can only see “Venture” and “Capital” as separate words.

To do this, we first create a list from our generator and then we let the gensim package take care of the rest. You will find with a lot of NLP work the bulk of the heavy lifting is gathering, cleaning and processing the data and that common packages handle the nuances of the Natural Learning Processing itself:

train_texts = list(build_texts())

bigram_phrases = gensim.models.Phrases(train_texts, common_terms=stops)

bigram = gensim.models.phrases.Phraser(bigram_phrases)

 We can then test some common bigrams that should appear:

assert (bigram['microfinance', 'institution'][0] == "microfinance_institution")

assert (bigram['venture', 'capital'][0] == "venture_capital")

assert (bigram['solar', 'panel'][0] == "solar_panel")

assert (bigram['big', 'data'][0] == "big_data")

assert (bigram['united', 'states'][0] == "united_states")

Now that we have our bigrams trained, we are going to create a dictionary to store the provider name, provider type, and text. In this example, to keep things simple, we store a text file called provider_type.txt with the name of the provider type in the same directory as text.txt.

While creating this dictionary, we are going to do some preprocessing, as well, in order to make the text more actionable for our classifier. I will walk through and explain the significance of each of the preprocessing steps after the jump.

import nltk

from gensim.parsing.preprocessing import STOPWORDS

stops = set(stopwords.words('english')).union(set(STOPWORDS))

documents = []

    for subdirectory in os.listdir(f"/Directory/"):

        with open(f"/Directory/{subdirectory}/text.txt") as f:

            with open(f"/Directory/{subdirectory}/provider_type.txt") as pt:

                provider_type = pt.readline()

            provider = dict(name=subdirectory, provider_type=provider_type)

            text = " ".join([x.strip() for x in islice(f, num_rows)])

            tokens = [word.lower() for word in nltk.word_tokenize(text) if word.isalpha() and word.lower() not in stops]

            tokens = bigram[tokens]

            provider['tokens'] = tokens

            documents.append(provider)

First, we use the word_tokenize method from nltk that breaks up the string of text into words and punctuation.

We also use the isalpha method that checks if the word is alphabetic characters only, because this is all that we are interested in.

The next bit of preprocessing is to eliminate stopwords. Stopwords are the list of most common words within a lexicon, like “the”, “he”, “she”, “and”. We often don’t care about these words and they can have a negative impact on our analysis.

Next, we gather the bigrams from within the list of tokens and finally we add the tokens to our provider dictionary and we add that dictionary to our list of documents, which will later be split into our test and training datasets.

Feature Sets

Next, we want to create what are called feature sets. Feature sets are lists of features (in this case word segments) that allow our model to quantify the text.

For our features, we have chosen to use the 100 most common words, excluding any words three characters or fewer, across all documents. We make use of the nltk’s FreqDist class to identify these and have manually identified two additional features of “venture” and “micro”. These will be particularly useful for differentiating between MFIs and VC firms.

custom_word_features = ['venture', 'micro']

num_features = 100

all_words = nltk.FreqDist(word.lower() for provider in documents for word in provider['tokens'])

word_features = [word for (word, freq) in all_words.most_common(num_features) if len(word) > 3]

word_features = word_features + custom_word_features

We create a method called document_features that returns the features for a given document.

def document_features(document):

    document_words = set(document['tokens'])

    features = {}

    for word_feature in word_features:

        contains_feature = False

        for document_word in document_words:

            if word_feature in document_word:

                features[f"contains({document_word})"] = True

                contains_feature = True               

        features[f"contains({word_feature})"] = contains_feature

  return features

The above method may seem confusing, due to the nested for loops, but it’s really quite straightforward. In it, we are determining if the features (word segments) that we have identified are contained within the tokens of the analyzed document. If they are, then we add that feature to the list of features within the document, if not we exclude it as a feature. Once we have a list of features for each document, we can compare these features in order to determine which features are most applicable to each provider type (MFI or VC).

Training our classifier

Finally, we get to train our classifier!

random.shuffle(documents)

featuresets = [(document_features(d), provider_type) for d in documents if d[‘provider_type’] in provider_types]

num_featuresets = len(featuresets)

train_set, test_set = featuresets[:math.floor(num_featuresets/2)], featuresets[math.floor(num_featuresets/2):]

classifier = nltk.NaiveBayesClassifier.train(train_set)

First, we shuffle our documents, so that we can ensure we divide them randomly between training and testing datasets.

Next, we use our document_features method above to build our featuresets, which is a tuple of document features and provider type.

Then wes divide our test sets evenly and run classifier = nltk.NaiveBayesClassifier.train(train_set). There are many different implementations of Natural Language Classifiers, but for our purposes the NaiveBayesClassifier provided by nltk works just fine and could not be easier to implement.

It is somewhat difficult to grasp that 62 lines of code in this example are dedicated to cleaning, processing and preprocessing data and that only 1 line is used for actually training the classifier, but this is the nature of data science.

Now, time to test our new classifier on our test data:

print("Classifier accuracy percent:", (nltk.classify.accuracy(classifier, test_set))*100)

This tells us the accuracy of our classifier when applied to our test data. An example of output, which is what we get when we run ours is:

Classifier accuracy percent: 91.28367670364501

This let’s us know that our classifier is about 91% accurate. This is good, but we believe we can do better. Let me show you a couple of common ways to debug this process.

One example is to show the most informative features:

classifier.show_most_informative_features(150)

This will show us the 150 most important aspects that our model used for determining its classifications. These should pass the “eyeball test”. For example, we would expect anything with “micro” or “venture” to be extremely informative features. An example of output:

Most Informative Features
contains(savings_account) = True           Microf : Ventur =     67.5 : 1.0
contains(account_opening) = True           Microf : Ventur =     50.1 : 1.0
 contains(savings_loans) = True           Microf : Ventur =     42.5 : 1.0
  contains(micro_credit) = True           Microf : Ventur =     38.8 : 1.0
contains(microfinance_bank) = True           Microf : Ventur =     38.0 : 1.0
contains(personal_loans) = True           Microf : Ventur =     35.0 : 1.0
      contains(startups) = True           Ventur : Microf =     31.4 : 1.0
   contains(marketplace) = True           Ventur : Microf =     31.1 : 1.0
contains(savings_products) = True           Microf : Ventur =     30.1 : 1.0
contains(loan_repayments) = True           Microf : Ventur =     27.4 : 1.0
contains(portfolio_companies) = True           Ventur : Microf =     26.8 : 1.0
contains(loan_application) = True           Microf : Ventur =     24.4 : 1.0
 contains(asset_finance) = True           Microf : Ventur =     23.6 : 1.0
contains(micro_businesses) = True           Microf : Ventur =     23.6 : 1.0
contains(savings_accounts) = True           Microf : Ventur =     23.3 : 1.0
...

This example indeed passes the eyeball test with each feature corresponding to the correct category.

contains(savings_account) = True Microf : Ventur = 67.5 : 1.0 

Tells us that if the text contains “savings account” it is 67.5 times more likely to be a Microfinance Institution than a Venture Capital firm. Pretty cool, but we haven’t figured out why our classifier is only 91% accurate.

An example of a slightly deeper analysis you can perform is the following, which prints the provider name, provider type and first 50 tokens. Through this analysis, we have actually found that the classifier is correct and our testing data is wrong!

for d in documents:

      featureset = (document_features(d), provider_type) 

            if classifier.classify(featureset[0]) == featureset[1]:

                continue

            else:

                print("******")

                print(f"{d['name']}")

                print(f"{d['provider_type']}")

                print(f"{d['tokens'][:50]}")

We can clearly see in the following output a couple of reasons why some of the capital providers are misclassified. For example, the first two are written in foreign languages. This teaches us that we should only try to classify texts after grouping by languages. The final example shows that the website was not properly scraped, because it requires javascript to load.

***INCORRECT!***
iSGS Investment Works
['Venture Capital']
['value', 'team', 'company', 'access', '五嶋一人', 'kazuhito', 'goshima', '代表取締役', '代表パートナー', '代表取締役', '佐藤真希子', 'makiko', 'sato', '取締役', '代表パートナー', '新卒一期生', 'mvp', '取締役', '代表パートナーに就任', '菅原敬', 'kei', 'sugawara', '取締役', '代表パートナー', '現アクセンチュア', 'ジャパン', 'investor', 'が発表した', 'executive', 'team', 'のインターネットセクター', 'best', 'cfo部門', '取締役', 'company', '社名', '株式会社isgs', 'インベストメントワークス', 'isgs', 'investment', 'works', '所在地', '資本金', '取締役_代表取締役', '代表パートナー', '五嶋一人', '取締役', '代表パートナー', '佐藤真希子', '取締役']

***INCORRECT!***
N2V
['Venture Capital']
['من', 'نحن', 'الوظائف', 'اتصل_بنا_english_العودة', 'إطلاق', 'مبادرة', 'تك', 'سباركس', 'نحن_فخورون', 'بأن', 'نعلن', 'لكم', 'عن', 'إطلاق', 'آخر', 'مبادراتنا', 'لدعم_الشركات', 'الناشئة', 'والرياديين', 'في', 'المنطقة', 'وهي', 'مبادرة', 'تك', 'سباركس', 'تهدف', 'المبادرة', 'لمساعدة', 'الرياديين', 'لبناء', 'مشاريعهم', 'والإرتقاء', 'بها', 'من', 'خلال', 'عرض', 'مقابلات', 'ملهمة', 'مع', 'شخصيات', 'بارزة', 'في', 'عالم', 'الريادة', 'من', 'موجهين', 'وأكثر', 'الهدف', 'الرئيسي', 'لمبادرة']

***INCORRECT!***
Al Tamimi Investments
['Venture Capital']
['javascript_required', 'enable_javascript', 'allowed', 'page']

If your classifier is not performing as you expect it, have a look at which examples it gets wrong and see if there are any common trends!

Conclusion

I hope you enjoyed this NLP walkthrough and it helps you or your organization make a bit more sense of Machine Learning, NLP, and Artificial Intelligence concepts.

If you’re interested in working on these kinds of projects please check our careers site, or email me at malcolm [at] alliedcrowds.com!

Alternative Finance Data for Emerging Markets: Building the Capital Finder (Part II)

The following was written by CTO Malcolm Kapuza, who explains how alternative finance data in the Capital Finder is collected, how funders are categorized, and how we use natural language processing and machine learning to enhance our database.

As I mentioned in the first part of this three part series, the Capital Finder has many useful applications. In this second part, I go into detail about where we source our different alternative finance data points and how we turn this data into one coherent and intuitive view of developing world alternative finance. In the final part of the series, I walk readers through the steps — and even some of the actual code — that makes this all possible.

This is a semi-technical explanation to show what’s “under the hood,” and to demonstrate what we do at AlliedCrowds that is so different from other organizations in this space.

Sourcing and maintaining the data

The key to a product like the Capital Finder is having the most current and accurate data in the industry. There are a couple of ways that we stay on top of this to ensure that our clients are able make strategic decisions and inform research based on the best available data out there.

Firstly, we have a team of multilingual analysts, spread throughout the developing world. Local knowledge is crucial in the sourcing process, as it minimizes language barriers and unforeseen geographic constraints. Our analysts have deep understanding of the alternative finance space, meaning finding new capital providers is relatively easy.

Secondly, we have developed deep industry connections through our time analyzing and researching this space, which means that we are one of the first to know when a new funder opens up, or an existing one expands into one of our target geographies. .

Finally, we are constantly pushing the envelope when it comes to innovating with technology solutions at AlliedCrowds. Therefore, we have developed programmatic processes that flag new capital providers that emerge and alert us when certain information in our database has gone stale.

Finding suitability

The quality and accuracy of our alternative finance data is only one aspect of what we do. The other is our ability to pair relevant projects with capital providers, as well as categorize capital providers based on continually changing criteria.

Given the sheer size of our data and the relative small size of our team, we depend on cutting edge technologies to make this possible. I will outline 3 use cases to give some idea of why this works so well and how we’re able to do it with minimal resources.

Alternative finance data collection

AlliedCrowds has streamlined and continues to improve our data collection process. Our main focus is to increase automation, while also improving accuracy and data integrity.

We gather text from thousands of websites and millions of web pages, creating a data warehouse of text. Recent developments in Natural Language Processing (NLP) have allowed us to rapidly and continuously improve our view of the space, because it takes only moments to process all of this text rather than months to revisit every website individually. This means that our insights are faster, more accurate and more scalable than if this analysis were completed manually.

africa impact investing

An example of the sort of data we can showcase using the Capital Finder.

Additionally, we source data from 3rd party APIs in order to get a more complete and accurate view of each capital provider. These sources are invaluable, each giving us unique and actionable data, and each working as a trusted check and balance to allow us to spot anomalies. These include social media platforms, news agencies, development institutions, public records, etc.

We crowdsource information that we cannot collect programmatically and which would be prohibitively expensive to collect through analysts. We have used technology to create straightforward ways for providers from our database to deliver us valuable information. This streamlined process is simple and incentivized with increased visibility on the Capital Finder and inbound traffic, which has lead to high engagement.

Intelligent Categorization      

As our technology advances, we have begun to reduce the workload of our analyst team. For instance, analysts do not fill in country- and sector-level information, because we have found our algorithms to be more accurate, faster, and more cost efficient than their user input input. This reduces human error in our data and allows us to scale massively.

Through this process, we are close to eliminating much of the decision-making from our analyst roles. The goal is for every entry to be fact based and subsequently fact-checked, which eliminates the need for timely and costly training and removes ambiguity from the data collection processes.This also ensures scalability and consistency throughout the system, as well as a much lower maintenance costs, as we eliminate dependence on any given analyst’s specific skill set or expertise. The system channels most of the decision-making and reasoning to the highest level and therefore allows the Capital Finder to be managed centrally. Anything that an analyst does to improve the system gets spread throughout all 138 countries, as well as all sectors and provider types.

Bespoke matching

The advances have allowed us to create a distinctly effective matching system. With our data pool, we are able to comb through millions of web pages to target specific keywords and phrases on these sites and since we track social followings, recent trends, public filing information, and unique provider statistics, we are able to gauge the suitability of a certain project to each provider.

The goal of our matching algorithm is much like that of Google’s Pagerank algorithm. When you use Google to query a phrase, Google returns not only every website that is relevant to your search, but orders them based on which it has determined will be most relevant. Since this algorithm is so effective, you rarely look beyond the first or second search result. We are making it so that finding a capital provider is just as easy.

Stay tuned for the next post where I discuss in more technical detail the nuances of what makes the Capital Finder so effective.

Alternative Finance Data for Emerging Markets: Explaining the Capital Finder

The following was written by CTO Malcolm Kapuza, who explains how the Capital Finder can be used to find suitable funders for your business, and how it provides rich alternative finance data.

One of our flagship products at AlliedCrowds is the Capital Finder, which has a lot going on under the hood. I decided to write a three part blog series to help our readers better understand what we are up to and why it is important. This is Part I.

I will take readers from a very high level view of what we are using this data to accomplish in this first part. I then explain in more detail how the data is gathered and processed in the second part. Finally, in the third part, I go into some of Machine Learning and Natural Language Processing code used to make it all happen.

What is the Capital Finder?

The Capital Finder is a proprietary database that contains over 7,000 alternative capital providers across the developing world (ex-China).

The Capital Finder contains all of the sources of capital available to projects in a given country. This includes:

  • venture capital and private equity firms.
  • angel investor networks,
  • impact investors,
  • crowdfunding platforms,
  • accelerators,
  • foundations,
  • development banks,
  • international organizations,
  • retail banks and more.

Essentially, it is an exhaustive list of where you can raise money in each country.

Furthermore, this data is highly searchable and customizable, and contains a lot detailed information about each capital provider. We will get into this below.

What are some common use cases?

We categorize the uses for the capital finder two broad functionalities: the Searchable Directory, and the Data Platform.

Searchable Directory

The Searchable Directory is the entrepreneur-facing side of the Capital Finder. It is meant for projects, firms, organizations, and entrepreneurs who are looking to identify the firm that is most likely to provide them with funding. We manage to do this on a global scale, finding the handful of suitable funders out of thousands that are potentially relevant.


The Capital Finder is the ideal capital sourcing tool and it will produce an effective shortlist for you to get started. But what if you don’t have your pitch down? This is why we created the Entrepreneur Hub. The Capital Finder tells you who to contact, while the Entrepreneur Hub explains how to approach them.


The Capital Finder on AlliedCrowds.com is a simple free version with the basic functionality to allow project leaders and developing world entrepreneurs to get started. We offer additional services for paid clients; these include:

  • keyword/phrase searching: we comb thousands of websites and millions of web pages in order to discover which capital providers have mentioned which key phrases; this enables us to narrow down your search
  • business plan matching: we use natural language processing to analyze text provided in a business plan or project write up, and use machine learning to match projects to the most suitable funders around the world
  • additional data points: lots of value lies in the financial and deal data that we have acquired for a growing number of capital providers

Regardless, we think the free version is plenty for an individual. Often, we find that users do not need to search past the first or second page to find compatible funders (much like you don’t search past the first or second page of Google to answer your question). However, we recommend organizations looking to provide solutions at scale to get in touch.

You can find public examples of the Searchable directory at use in the AlliedCrowds Capital Finder,  SDG Capital Finder or the WGEO Capital Finder.

Data Platform

The Data Platform is linked to the entrepreneur-facing product in that they share the same database. They are, however, different in how each product uses the data.

The most straightforward use case for the Data Platform is on display in the alternative finance data we feature our reports. These reports are generated almost entirely automatically via the Data Platform. Generally speaking, we use the Data Platform to facilitate research, consulting and advisory, and, of course, alternative finance data analysis.

As an example, consider an organization that would like to investigate how many Kenyan alternative finance providers fund agriculture products, and how this compares to its neighbors. The Data Platform can show how Kenya compares to other East African countries on a per % land area basis, informing decisions on which country is performing better on a relative — rather than simply an absolute — basis.

alternative finance data

Alternative capital providers funding agriculture vs. agricultural land across Africa, focusing on East Africa

The scalability of the database allows AlliedCrowds to not only make country-level comparison (e.g., Tanzania, Uganda, Rwanda, and Kenya), but to also create entire indices (e.g., country rankings across the developing world). This gives us the ability to landscape the entire developing world based on a given sector (e.g., agriculture) and/or funding type (e.g., venture capital). Based on this analysis, we can see which countries are overachieving relative to their peers, and begin to explore why. Ultimately, this analysis can help us to come up with policy recommendations that result in more money flowing to projects and MSMEs in developing countries, leading to job growth and economic expansion on a macro level.

See the following reports for some examples of how our data has been used for clients in the past: Financial Sector Deepening Africa, UNDP Indonesia, and World Bank and feel free to contact us in order to find out more about how AlliedCrowds can provide custom reports and indices for your organization!

Kapuza’s second post in this series discusses how the alternative finance data in the Capital Finder is sourced and categorized. You can read that post here

Find Capital, Get Fundraising Advice, and More: Relaunch of Our Website

capital finder, entrepreneur hub, alternative finance reports

Technology has the potential to revolutionize the development sector — and we’re relaunching our website to show how this can be done. Our new site features an improved Capital Finder, Entrepreneur Hub, and Reports on alternative finance.

It’s been a long time since we’ve updated our site. AlliedCrowds.com has moved from: a crowdfunding project aggregator, to a live dashboard providing unique insight on crowdfunding in emerging markets, to an early version of the Capital Finder alternative finance database, to a Capital Finder-centric platform. Today, we’re relaunching our site to combine the best aspects of the previous versions, and to highlight how technology can innovate the development sector.

The new site, which launched earlier today, is the best iteration yet. Here’s why:

Updated Capital Finder

The Capital Finder is a unique database of 7,000 alternative capital providers across emerging markets. We’ve worked hard to make it the go-to place for: entrepreneurs looking for investors, development organizations looking for partners, and analysts looking for original data. The database has also been instrumental in compiling unique datasets and analysis for our reports and consulting engagements.

Today, we’re announcing the addition of new funder types to the Capital Finder, as well as a complete re-categorization of each funder on the platform. In addition to angel investor networks, crowdfunding platforms, impact investors, public / semi-public funders, and venture capital firms, we’ve added the following five types of funders:

  • Banks
  • Foundations
  • Microfinance institutions
  • Nonprofits
  • Private equity firms
  • More new categories will be added soon

The new, refined categories mean more accurate results. It’s a significant upgrade with a lot going on behind the scenes that will make the Capital Finder even more useful to entrepreneurs, development organizations, analysts, alternative capital providers, and many more.

We’ve also added profiles for each of the funders in our database:

impact investor renewable africa

Interactive Entrepreneur Hub

Knowing who to reach out to for funding is critically important. But where do you go from there?

We have given entrepreneurs information that helps them better understand the fundraising process. The Capital Finder has always been an effective tool that helps entrepreneurs answer the questions of who to fundraise from. Our Entrepreneur Hub is a natural extension of this, as it helps the answer the question of how to fundraise.

This means providing entrepreneurs with templates of documents they need to present to investors, tips from entrepreneurs who have been in their shoes in the past, explainer videos for key concepts, descriptions of capital types (grant, debt, equity) and funder types (VC, impact investor, etc.), and much more.

The new Hub makes this information more readily-available, and gives entrepreneurs the ability to focus on the aspects they need help with most.

business plan template

Blockchain and Cryptocurrency

Over the last year, we have paid close attention to blockchain and cryptocurrency developments.

We’ve always recognized the disruptive potential, and we recently made the decision to play an active role in this space. We’ve already built tools that allow us to track cryptocurrency prices in a new way, and we’re gathering insights on how that data can be used to further our mission of innovating development.

Based on our research and data, we have engaged with blockchain in the following areas:

  • Governance
  • Finance
  • Contracts
  • Corruption
  • Entertainment
  • Social media

Find out more about our blockchain services here.

More to Come!

It’s an exciting time at AlliedCrowds as we continue to roll out new products and services, improve our current offerings, and engage new clients.

We have some exciting announcements, including new clients,new products, and new reports — stay tuned! If you haven’t already, keep up with the latest by subscribing to our newsletter here.