ai on Amit Bahree's (useless?) insight!

AI generated Podcast for my book: Generative AI in Action 🎧

Sun, 13 Oct 2024 00:00:00 +0000

The one thing I wanted to do after my book Generative AI in Action was complete was to create a summary in natural speech and possibly use TTS (Text-to-speech) to create an audio summary—think of it as a podcast that is easier for people to consume and get a quick sense of what the book is about.

TTS (Text to Speech) or not to TTS?

Initially, I was inclined towards using TTS (Text to Speech) for the audio summary. This technology, I thought, would be a convenient way to create a podcast-like summary that would be easier for people to consume and get a quick sense of what the book is about. My journey began with TTS - using GPT 4o to create a summary after ingesting the book and then using that as into the Azure AI Speech stack . However, I stumbled across something intriguing. Instead of TTS, I opted for NotebookLM from Google Labs to generate the audio overview - the podcast. This decision marked a significant shift in my approach, and I created two podcasts using the content from the book - one from multiple sources and another from a single source (the book).

NotebookLM is an experimental AI-first notebook from Google Labs designed to help users gain insights faster by grounding the language model in their documents. It aims to assist with synthesizing facts and ideas from multiple sources, making connections quicker and easier. It can help users understand, summarize, and generate new ideas based on their content. What is fascinating is that it can generate audio, which is a natural dialog between two people - with wit, humor, and a natural flow. It is like conversing with someone who has read the book and is summarizing it for you. And if I hadn’t told you that this was AI-generated, it would be hard to tell that it was not a real conversation.

The “Podcasts”

For the first audio generation, the multiple sources I used were:

The book, and my blog post announcing the book
A real podcast I did with Miko on his podcast Hockeystick
And another real podcast with Jamie on his podcast The modern .NET show .

For the second audio generation I only used a single source - the book.

The results of the AI-generated audio are truly impressive. In each instance, the audio was produced in a natural voice, simulating a genuine conversation between two people. The quality of the audio, the conversation, and the flow are nothing short of mind-blowing. Even in the second audio, where some acronyms were not pronounced correctly, it’s a minor issue, considering the audio was generated in just a few minutes with the press of a button. I was genuinely surprised at the level of realism and natural flow in the conversation. 🤯

Have a listen and let me know what you think.

Podcast Summary 1 - using multiple sources

The sources provide a comprehensive overview of generative AI and its application within enterprises. The first source, a YouTube video transcript, features an interview with Amit Bahree, a technical program manager at Microsoft, who discusses the rise of large language models (LLMs) and their potential impact on society. The second source, a book excerpt, delves into the technical aspects of generative AI, covering foundational models, large language models, retrieval-augmented generation, and the architectural principles for building generative AI applications. The book also explores various use cases, including image generation, code generation, and the ethical considerations surrounding the use of generative AI.

Your browser does not support the audio element.

Book Podcast - using multiple sources

Podcast Summary 2 - using single source (the book)

This book is a comprehensive guide to Generative AI, focusing on how this transformative technology can be leveraged within an enterprise. It explains core concepts like foundational models and large language models (LLMs) as well as practical applications for generating various content, including text, images, code, audio, and video. The book also explores responsible AI practices, highlighting the importance of prompt engineering, ethical considerations, and security measures for implementing these technologies. The author emphasizes the need for careful evaluation, monitoring, and scalability when deploying Generative AI models in a production environment.

Your browser does not support the audio element.

Book Podcast - using single source

Conclusion

AI-generated podcasts showcase the potential of AI to revolutionize content creation. Generating natural-sounding audio summaries from text is a game-changer for authors, creators, and educators. As AI continues to advance, we anticipate more opportunities to improve how we access information. I am excited to explore the new possibilities in AI-generated content.

🎉Announcing My New Book: Generative AI in Action📚

Mon, 16 Sep 2024 00:00:00 +0000

In today’s rapidly evolving tech world, mastering Generative AI isn’t just an advantage—it’s a necessity. Are you ready to harness its power to transform your business and solve real-world challenges? I’m excited to announce that my new book, Generative AI in Action, is now available in print and ebook formats from Manning Publications . 📖

Special Launch Offer 🌟

As a thank-you to my early supporters, I’m offering an exclusive discount. Use the code pbbahree at checkout to receive 45% off your purchase of Generative AI in Action in all formats (valid through Sept. 30, 2024)!

Get your discounted copy 👉 » here « .

Cover of Generative AI in Action

Why You Should Read Generative AI in Action ✨

Comprehensive Coverage

This book offers an in-depth introduction to Generative AI, covering foundation models, large language models (#LLMs), small language models (#SLMs), and practical applications. From the basics to advanced topics like prompt engineering, Generative AI in Action provides everything you need to start building and scaling AI solutions. Whether you’re a beginner or a seasoned professional, you’ll find valuable insights to accelerate your AI journey.

Real-World Examples

Discover how enterprises across industries are leveraging Generative AI to innovate and solve complex problems. Whether it’s improving customer engagement or optimizing operations, the practical examples provided can be directly applied to your projects for immediate impact.

Hands-On Techniques

Dive into step-by-step guides and hands-on examples for integrating AI models into your workflows. Learn techniques such as:

Prompt Engineering: Craft effective prompts to unlock the full potential of AI models like GPT-4.
Retrieval-Augmented Generation (RAG): Enhance your AI models with real-time data for improved accuracy.
Model Adaptation: Fine-tune AI models to meet your organization’s specific needs.

Ethical AI and Best Practices

As AI becomes more critical in decision-making, understanding its ethical implications is crucial. Generative AI in Action covers topics like privacy, security, and bias mitigation—ensuring your AI deployments are fair, transparent, and aligned with your organizational values.

Expert Insights

Drawing from my experience helping build the Azure AI platform, I share insider knowledge on leveraging the latest advancements in AI for your projects. This book provides you with the tools to make the most of cutting-edge technologies like large language models (LLMs) and small language models (SLMs).

Advanced Techniques Covered in the Book 🔥

Beyond the basics, Generative AI in Action delves into advanced techniques essential for mastering Generative AI in modern enterprise environments:

Prompt Engineering: Strategies like zero-shot, few-shot, and many-shot learning, along with chain-of-thought reasoning, to optimize AI outputs.
Retrieval-Augmented Generation (RAG): Combine retrieval-based methods with generative models for real-time, relevant data integration.
Model Adaptation and Fine-Tuning: Customize generative models to specific tasks using techniques such as low-rank adaptation and reinforcement learning from human feedback (RLHF).
Chatting with Your Data: Build AI-powered chat systems that interact with enterprise data using vector databases and retrieval techniques.
Scaling and Production Deployment: Strategies for scaling AI solutions while ensuring performance, reliability, and compliance with enterprise standards.
Evaluations and Benchmarks: Learn to evaluate and benchmark AI models using traditional metrics and cutting-edge frameworks.

Explore the GitHub Repository 💾

For those eager to dive into the code, the book has a companion GitHub repository filled with examples and projects to get you started. Check it out at bit.ly/GenAIBook . Explore the code, experiment, and start building your AI-powered solutions today.

Get Your Copy Today! 🛒

Don’t miss this opportunity to lead the AI revolution within your organization. Order your copy of Generative AI in Action and use the code pbbahree to receive 45% off (valid through Sept. 30, 2024). Transform your organization’s AI capabilities today!

With gratitude 💚

Amit Bahree.

In the world of AI, there’s a thrill,
With “Generative AI in Action,” you’ll skill.
From prompts to fine-tuning,
Your projects are blooming,
Grab your copy, and master the drill!

SLMs - Running Phi-3 on an iphone and locally

Wed, 08 May 2024 00:00:00 +0000

We released Phi-3 recently, which builds on Phi-2 ( read more on that here ) and it is a great model to use for various tasks. In this post, we will show how to run Phi-3 locally including a demo of it running on a phone.

There should have been a video here but your browser does not seem to support it.

1. What are Small Language Models (SLMs)?

Before diving into running Phi-2 locally, let’s take a moment to understand the concept of small language models (SLMs) and their significance in natural language processing (NLP). A SLM is a type of AI model that has been trained on a massive dataset of text but is limited in terms of its size and capabilities compared to a Large Language Model (LLM). SLMs are designed to be more lightweight and efficient, making them suitable for various applications, including chatbots, language translation, and content generation. SLMs are much smaller than LLMs, with fewer parameters and a smaller dataset, so they have a lower computational cost, making them more suitable for edge or resource-constraint devices.

2. What is Phi-2?

Phi-2 is the latest model in the Phi series of small language models (SLMs) that aim to break the conventional scaling laws of language models. Unlike large language models (LLMs) that require massive amounts of data and compute resources, Phi models are trained on a mixture of web-crawled and synthetic “textbook-quality” data, following the idea of Textbooks Are All You Need . Phi models also leverage innovations in model architecture, optimization, and data augmentation to achieve remarkable performance on various benchmarks. 😀

Phi-2 is twice as large as its predecessor Phi-1.5, and was trained for two weeks on a cluster of 96 A100 GPUs. It demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance (for LLMs <13 billion parameters).

As shown in the figure below on complex benchmarks, Phi-2 matches or outperforms models up to 25x larger, such as Llama-2 and Gemini Nano. Phi-2 is also an ideal playground for researchers, as it allows for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on various tasks. Phi-2 is available in the Azure AI Studio (via the model catalog ) and on Hugging Face.

Phi-2 Evaluation

2.1 Phi-2’s Capabilities

What better way to show this other than Some prompts that Phi-2 can handle:

Creative Writing:
- 🗒️Prompt: “Write a short story about a time-traveling computer geek who discovers a secret algorithm book with the power to change computing history.”
- 🤖 Phi2-Generated output: “In the dimly lit archives of the ancient library, Amelia stumbled upon a dusty tome unlike any other. Its pages whispered secrets of forgotten civilizations and whispered promises of altered destinies. As she traced her fingers over the delicate script, the room seemed to blur, and suddenly, she was standing in the heart of Renaissance Florence. The air smelled of ink and ambition. The book had worked its magic, and Amelia knew she held the fate of empires in her hands.”
Philosophical Reflection:
- 🗒️Prompt: “Discuss the concept of free will versus determinism.”
- 🤖 Phi2-Generated output: “Free will and determinism have long been at odds in philosophical discourse. Some argue that our choices are predetermined by external factors—genetics, upbringing, societal norms—while others champion the idea that we possess true agency. Perhaps the truth lies in the delicate dance between the two. We navigate life’s labyrinth, sometimes guided by fate, other times forging our own path.”
Educational Explanation:
- 🗒️Prompt: “Explain the process of photosynthesis in simple terms.”
- 🤖 Phi2-Generated output: “Photosynthesis, the magical dance of plants and sunlight, is how they create their own food. Imagine tiny chlorophyll factories in leaves, capturing sunlight like solar panels. They convert that energy into chemical fuel—glucose—while releasing oxygen as a byproduct. So next time you breathe in that fresh air, thank a leaf for its photosynthetic prowess!”
Humorous Scenario:
- 🗒️Prompt: “Describe a day in the life of a mischievous squirrel.”
- 🤖 Phi2-Generated output: “Meet Nutty, the squirrel extraordinaire. His mornings begin with acorn heists, followed by elaborate treehouse construction. He’s the neighborhood’s unofficial alarm clock, chattering loudly at dawn. But his pièce de résistance? Prank-calling the crows, convincing them they’ve won a lifetime supply of shiny objects. Nutty’s motto: ‘Life’s too short not to be a little nuts.’”

3. Small Language Models (SLMs) vs Large Language Models (LLMs)

Large Language Models (LLMs) are a type of AI model that is much larger and more powerful than SLMs. LLMs have hundreds of billions of parameters and are trained on massive text datasets. This gives LLMs the ability to handle complex tasks, such as language generation, translation, and question answering, with high accuracy and fluency. However, LLMs also have some disadvantages. They are larger, making them more expensive and slower to train. They also have a higher computational cost, meaning they may require access to specialized hardware.

On the other hand, SLMs, as we called out, are smaller and more lightweight than LLMs, making them more efficient and cost-effective in training computing resources and inference. While it might seem that SLMs are also more suitable for edge or resource-constrained devices, such as mobile phones or IoT devices, they are small compared to LLMs but still require significant computational resources to run. Phi-2, for example, still has 2.7B parameters, and while it can make inferences on a CPU, it is very slow and impractical for real-time applications. One would need a GPU or a cloud-based service for any realistic use case.

3.1 When to use SLM vs LLM?

Firstly, neither model is inherently better - the choice between an SLM and an LLM depends on the specific application and requirements. SLMs are a good choice when size, cost, and speed are important considerations. LLMs are a better choice when high performance and complex capabilities are required. If a task at hand is quite narrow and in one of the supported languages, then SLMs might be good. However, for a given task, an SLM may be sufficient, but an LLM may be necessary for more complex tasks or tasks requiring high accuracy and fluency.

Furthermore, it is key to understand that it is not necessarily about the number of languages understood but rather the depth and nuance with which each model can understand and generate language. SLMs are designed to be efficient and effective within their scope, which may include a wide range of languages. LLMs like GPT-4, due to their size and complexity, often can understand and generate text in a larger number of languages and with greater nuance.

The choice between an SLM and an LLM would again depend on the specific requirements of the task, including the languages involved and the level of language understanding and generation needed. Using a combination of SLMs and LLMs is common to achieve the best results for a given application.

4. Running Phi-2 locally

On one hand, running this is simple if you just don’t want to program anything and only want to use the model. The easiest option in this case is to use [LM Studio , a web-based platform for running language models. You can use the Hugging Face API to download and run the model.

We use a simple console chat example that runs locally. We use the Hugging Face Transformers library to generate text based on user input. The user can generate a story, a haiku, or a joke on a topic of their choice. Here is how to run it locally on a Windows machine - the same should apply to a Mac or Linux machine.

The full code is below, but here are the key aspects to grok when running Phi-2 locally.

The key is to use the AutoModelForCausalLM and AutoTokenizer classes from the transformers library to load the Phi-2 model and tokenizer.
We then use the generate method to generate text based on a user prompt. The generate method takes the user prompt as input and returns the generated text
We use the from_pretrained method to load the model and tokenizer from the Hugging Face model hub.
We also use the save_pretrained method to save the model and tokenizer to a local directory. This allows us to load the model and tokenizer from the local directory if they are already saved, which can help save time and resources.

The following code snippet is what loads the model and the tokenizer:

# Download the model and tokenizer from Hugging Face
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2",
                                            torch_dtype="auto",
                                            trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2",
                                         trust_remote_code=True)

And the following is where we encode the user input and call the generation. First, we create tokens of the user prompt; the resulting tokens are returned as PyTorch tensors. Then, the model generates text based on the tokenized input. We cap the tokens to a maximum of 500 tokens, and the end-of-sequence token is used for padding if necessary. Finally, the generated tokens are decoded back into human-readable text.

inputs = tokenizer(prompt,
                   return_tensors="pt",
                   return_attention_mask=False,
                   add_special_tokens=False)
outputs = model.generate(**inputs,
                         max_length=500,
                         pad_token_id=tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]

The full code is below. The code is a simple console chat example that runs locally. The user can generate a story, a haiku, or a joke on a topic of their choice.

Some examples of what Phi-2 can generate using the above code are shown below. The first is a story about pandas and dogs.

Story about Pandas 🐼 and Dogs 🐶

Here is another example of a Haiku and a Joke generated by Phi-2 on Pandas.

Haiku and Joke about Pandas 🐼

Switching gears, let’s look at how we can implement the RAG using Phi-2.

4.1 Running Phi-2 Locally - Full Code

The following code is the complete code that executes the examples we showed before for running Phi-2 locally. This can work on a CPU, but it is very slow, and a good GPU is strongly suggested.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93


import os
import warnings
import logging
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

DEBUG = False

# Suppress warnings and set the logging level to ERROR
warnings.filterwarnings("ignore")
logging.getLogger("transformers").setLevel(logging.ERROR)

# Define the directory where you want to save the model and tokenizer
MODEL_PATH = "./local_model"

# Check if the model and tokenizer are already saved locally
if os.path.exists(MODEL_PATH):
    if DEBUG:
        print(f"Loading model and tokenizer from local directory: {MODEL_PATH}")

    # Load the model and tokenizer from the local directory
    model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
else:
    if DEBUG:
        print(f"Downloading model and tokenizer from Hugging Face")

    # Download the model and tokenizer from Hugging Face
    model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

    if DEBUG:
        print(f"Saving model and tokenizer to local directory: {MODEL_PATH}")

    # Save the model and tokenizer locally
    model.save_pretrained(MODEL_PATH)
    tokenizer.save_pretrained(MODEL_PATH)

if DEBUG:
    print(f"Model device: {model.device}")
    print(f"CUDA available: {torch.cuda.is_available()}")

# Set the default device to CUDA if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def handle_prompt(user_input, type_of_text)->str:
    prompt = f"Instruct: Write a {type_of_text} about {user_input}\nOutput:"

    with torch.no_grad():
        inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False, add_special_tokens=False)
        inputs = {name: tensor.to(model.device) for name, tensor in inputs.items()}
        outputs = model.generate(**inputs, max_length=500, pad_token_id=tokenizer.eos_token_id)
        
    text = tokenizer.batch_decode(outputs)[0]

    # Remove the prompt from the output text
    text = text.replace(prompt, '').strip()
    text = text.replace("<|endoftext|>", '').strip()
    
    if DEBUG:
        print(f"Answer:{text}")

    return text

if __name__=="__main__": 
    while True:
        print("First What would you like to write today?")
        print("1. Story 📝")
        print("2. Haiku ✍️")
        print("3. Joke 😆")
        print("4. Quit 👋")
        user_choice = input("Choose an option:")

        if user_choice == '4':
            break

        user_prompt = input("And on which topic:")

        if not user_prompt.strip():
            print("Input cannot be empty or consist only of spaces.")
            continue

        if user_choice == '1':
            print(handle_prompt(user_prompt, 'story'))
        elif user_choice == '2':
            print(handle_prompt(user_prompt, 'haiku'))
        elif user_choice == '3':
            print(handle_prompt(user_prompt, 'joke'))
        else:
            print("Invalid choice. Please choose a valid option.")

        print("_" * 100)

5. Implementing Retrieval-Augmented Generation (RAG) with Phi-2

RAG is a powerful technique that combines the strengths of retrieval-based and generation-based approaches to natural language processing. RAG is one of the ways one can get proprietary information and knowledge to the model and use it as part of the prompt. It leverages a retriever to find relevant context passages and a generator to produce fluent and coherent responses. The retriever identifies relevant context passages, and the generator uses these passages to generate a response.

This approach allows RAG to produce high-quality, informative, and contextually relevant responses. In-context learning is a key feature of RAG, as it allows the model to learn from the context of the conversation and generate more accurate and relevant responses. This is particularly useful in scenarios where the model needs to understand and respond to complex queries or provide detailed information on a specific topic.

At a high level, the process of implementing RAG involves the following steps:

Generate Embeddings with Phi-2:
- Use Phi-2 to encode your context passages (documents) and extract their embeddings.
- These embeddings will represent the semantic content of each passage.
Create a Vector Index:
- Choose a vector index library or framework (such as Faiss, Annoy, or HNSW).
- Initialize an index structure to store the embeddings efficiently.
- Add the generated embeddings to the index.
Save Embeddings to a Local Vector Database:
- Create a local database to store the embeddings.
- For each context passage, save its corresponding embedding in the database.
- You can use the passage ID or a unique identifier as the key for retrieval.
Perform Similarity Search:
- When you receive a new context (query), encode it using Phi-2 to obtain its embedding.
- Use the vector index to perform a similarity search against the saved embeddings.
- Retrieve the most similar context passages based on cosine similarity or another distance metric.
- Return the relevant passages as results.

In our example, we will use the FAISS library to create a vector index and perform a similarity search. We will also save the embeddings to a local database for efficient retrieval. FAISS (Facebook AI Similarity Search) is a library developed by Facebook for efficient similarity search and clustering of high-dimensional vectors. It allows for a quick nearest-neighbor search over large datasets and supports CPU and GPU-based computations. FAISS is widely used in information retrieval, recommendation systems, and other applications that require similarity search.

5.1 Loading data for RAG and Phi-2

To implement RAG, we use the script from the Oppenheimer movie - which is quite new in that it is not in the Phi-2 training set and is available as a PDF. We will extract the script from this PDF, creating embeddings, which will then save the embeddings to a local database and perform a similarity search to retrieve relevant context passages based on a user query. We will use the FAISS library to create a vector index and perform a similarity search. We will also save the embeddings to a local database for efficient retrieval.

We use the PyPDF2 library to parse PDFs, a pure Python library for reading and writing PDF files. It can extract text, merge and split documents, and more. We will use it to extract the PDF text from the Oppenheimer movie script. The following code function shows how to read the PDF and extract the text. This is efficient for our use case, but it is not the most efficient way to extract text from a PDF when thinking about production scale, especially if the PDF has a lot of images and tables.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


def read_pdf(file_path):
    pdf_file_obj = open(file_path, 'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    num_pages = pdf_reader.numPages
    for page_num in range(num_pages):
        page_obj = pdf_reader.getPage(page_num)
        text = page_obj.extractText()
        yield text
    pdf_file_obj.close()

5.2 Generate embeddings using Phi-2

Now that we have the text, the following functions show how to create the embeddings using Phi-2. We read the text as a list of context passages and then use Phi-2 to encode each passage and extract its embedding using the encode method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


def create_embeddings(file_path, tokenizer, model, device):
    if file_path.endswith('.pdf'):
        context_passages = list(read_pdf(file_path))
    else:
        with open(file_path, 'r', encoding='utf-8') as file:
            context_passages = file.readlines()

    embeddings = []
    for passage in tqdm(context_passages):
        if not passage.strip():
            # Skip the passage
            pass
        input_ids = tokenizer.encode(passage, 
                                    return_tensors="pt",
                                    add_special_tokens=False,
                                    return_attention_mask=False).to(device)
        with torch.no_grad():
            output = model(input_ids)
            logits = output.logits
            embedding = logits.mean(dim=1).detach().cpu().numpy()
            embeddings.append(embedding)

    return embeddings, context_passages

Here are a few things that are going on:

Given that we are using this for inference and not training, we use a torch.no_grad()`, which tells PyTorch not to track, calculate, or modify gradients while executing code within this block. This helps us save the amount of memory needed.
Inside this block, the input_ids are fed into the model, and the output is stored in the output variable. The logits, which are the raw, unnormalized scores outputted by the last layer of the model, are then extracted from the model’s output.
The logits are then processed to generate the embedding for the passage. The .mean(dim=1) method calculates the mean of the logits along dimension 1, which typically represents the sequence length in a language model.
The .detach() method detaches the result from the computation graph so that no gradients will be backpropagated along this variable.
The .cpu() method moves the tensor to the CPU if it’s not already there. Finally, the tensor is converted to a numpy array using the .numpy() method.
The resulting embedding is then appended to the embeddings list, which contains the embeddings for all the passages.

5.3 Creating Vector Index

The following function shows how to create a vector index using the FAISS library and perform a similarity search to retrieve relevant context passages based on a user query. The create_index function initializes a flat index structure to store the embeddings and adds the embeddings to the index. The search_query function encodes the user query using Phi-2 to obtain its embedding and performs a similarity search against the saved embeddings to retrieve the most similar context passages.

def create_index(query_embedding):
    index = faiss.IndexFlatL2(query_embedding[0].shape[1])  # Euclidean distance
    faiss.normalize_L2(query_embedding)
    
    # Add embeddings to the index
    for i, item in tqdm(enumerate(query_embedding), total=len(query_embedding)):
        if item.ndim == 1:
            item = item.reshape(1, -1)  # Reshape 1D array to 2D
        index.add(item)
    return index

The normalize_L2() function normalizes the vectors and is a crucial step when using Euclidean distance in high-dimensional spaces to ensure that the distance is not dominated by the dimensionality of the vectors. As we iterate through the embeddings, we check if the item is a 1D array and reshape it to a 2D array if necessary. This is important because FAISS expects the input to be a 2D array, and we need to reshape the 1D array to a 2D array before adding it to the index.

The function finally returns the created index. This index can then be used to perform efficient similarity searches.

5.4 Perform Similarity Search

The following function shows how to perform a similarity search using the vector index to retrieve relevant context passages based on a user query. As noted earlier, the most similar context passages are then retrieved based on cosine similarity or another distance metric.

The function starts by encoding the input query using a tokenizer and performs a similarity search on the FAISS index using the query embedding. It retrieves the indices of the top 3 most similar passages to the input query and then retrieves the corresponding context passages from the context_passages list. The similar context passages are then concatenated into a single string and passed to the Phi-2 model to generate a response.

def search_query(input_query, inputTokenizer, model, device, index, context_passages):
    # Given a new query context, encode it and perform similarity search
    input_ids = inputTokenizer.encode(input_query,
                                    return_tensors="pt",
                                    return_attention_mask=False,
                                    add_special_tokens=False).to(device)
    with torch.no_grad():
        input_ids = input_ids.long()
        output = model(input_ids)
        logits = output.logits
        query_embedding = logits.mean(dim=1).detach().cpu().numpy()

    # Perform similarity search - top 3 similar passages
    _, similar_indices = index.search(query_embedding, k=3)
    
    # Retrieve context passages based on similar_indices
    similar_contexts = [context_passages[i] for i in similar_indices[0]]

    # Concatenate the similar contexts into a single string
    context = ' '.join(similar_contexts)

Let us run this and see how it works, as we discussed before. We will use the Oppenheimer movie script as the context passages and perform a similarity search to retrieve relevant context passages based on a user query. The next few figures show the output of us asking questions about the movie, where those pieces of information are not in the model but passed using the semantic search.

Example 1 - Phi-2 and RAG implementation

Example 2 - Phi-2 and RAG implementation

Example 3 - Phi-2 and RAG implementation

Now that we have seen the different elements, the code below brings everything together as a console app that one can run. The Oppenheimer script (pdf file) you need can be downloaded from here .

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189


import os
import warnings
import logging
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
import faiss 
from tqdm import tqdm
import pickle
import re
import PyPDF2

DEBUG = False

warnings.filterwarnings("ignore")
logging.getLogger("transformers").setLevel(logging.ERROR)

# Define the directory where you want to save the model and tokenizer
MODEL_PATH = "./local_model"
MODEL_NAME = "microsoft/phi-2"
BATCH_SIZE = 1000

# Oppenheimer movie
#DATA_FILE = "./oppenheimer-2023.txt"
DATA_FILE = "./oppenheimer-2023.pdf"
EMBEDDINGS_FILE = './embeddings_movie.pkl'

def load_model(model_path, model_name, debug=False):
    # Check if the model and tokenizer are already saved locally
    if os.path.exists(model_path):
        if debug:
            print(f"Loading model and tokenizer from local directory: {model_path}")

        # Load the model and tokenizer from the local directory
        model = AutoModelForCausalLM.from_pretrained(model_path)
        tokenizer = AutoTokenizer.from_pretrained(model_path)
    else:
        if debug:
            print("Downloading model and tokenizer from Hugging Face")

        # Download the model and tokenizer from Hugging Face
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

        if debug:
            print(f"Saving model and tokenizer to local directory: {model_path}")

        # Save the model and tokenizer locally
        model.save_pretrained(model_path)
        tokenizer.save_pretrained(model_path)

    if debug:
        print(f"Model device: {model.device}")
        print(f"CUDA available: {torch.cuda.is_available()}")

    # Set the default device to CUDA if available, otherwise use CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    return model, tokenizer, device

def save_embeddings(embeddings, passages, file_name):
    try:
        with open(file_name, 'wb') as f:
            pickle.dump((list(embeddings), list(passages)), f)
    except IOError:
        print(f"Error writing to file {file_name}.")
    except pickle.PicklingError:
        print("Error pickling embeddings and passages.")
    
def load_embeddings(file):
    try:
        with open(file, 'rb') as f:
            return pickle.load(f)
    except FileNotFoundError:
        print(f"File {file} not found.")
        return None
    except pickle.UnpicklingError:
        print(f"Error unpickling file {file}.")
        return None

def read_pdf(file_path):
    pdf_file_obj = open(file_path, 'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    num_pages = pdf_reader.numPages
    for page_num in range(num_pages):
        page_obj = pdf_reader.getPage(page_num)
        text = page_obj.extractText()
        yield text

    if DEBUG:
        print(f"Finished reading file. Number pages: {num_pages}")
    
    pdf_file_obj.close()

def create_embeddings(file_path, tokenizer, model, device):
    if os.path.exists(EMBEDDINGS_FILE):
        # Load the embeddings and passages from disk
        embeddings, context_passages = load_embeddings(EMBEDDINGS_FILE)
    else:
        if file_path.endswith('.pdf'):
            context_passages = list(read_pdf(file_path))
        else:
            with open(file_path, 'r', encoding='utf-8') as file:
                context_passages = file.readlines()

        embeddings = []
        for passage in tqdm(context_passages):
            if not passage.strip():
                # Skip the passage
                pass
            input_ids = tokenizer.encode(passage, return_tensors="pt", add_special_tokens=False, return_attention_mask=False).to(device)
            with torch.no_grad():
                output = model(input_ids)
                logits = output.logits
                embedding = logits.mean(dim=1).detach().cpu().numpy()
                embeddings.append(embedding)

        # Save the embeddings and passages to disk
        save_embeddings(embeddings, context_passages, EMBEDDINGS_FILE)

    return embeddings, context_passages
  
def handle_prompt(user_input, context)->str:
    prompt = f"Instruct: You are a helpful bot who only answers using the given context ONLY. If you cannot find the answer in the context reply 'Sorry don't have that detail'. Given the context '{context}', answer this:{user_input}\nOutput:"

    with torch.no_grad():
        inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False, add_special_tokens=False)
        inputs = {name: tensor.to(model.device) for name, tensor in inputs.items()}
        outputs = model.generate(**inputs, max_length=2000, pad_token_id=tokenizer.eos_token_id)
        
    text = tokenizer.batch_decode(outputs)[0]

    # Remove the prompt from the output text
    text = text.replace(prompt, '').strip()
    text = text.replace("<|endoftext|>", '').strip()
    return text

def create_index(query_embedding):
    index = faiss.IndexFlatL2(query_embedding[0].shape[1])  # Euclidean distance
    faiss.normalize_L2(query_embedding)
    
    # Add embeddings to the index
    for i, item in tqdm(enumerate(query_embedding), total=len(query_embedding)):
        if item.ndim == 1:
            item = item.reshape(1, -1)  # Reshape 1D array to 2D
        index.add(item)
    return index

def search_query(input_query, inputTokenizer, model, device, index, context_passages):
    # Given a new query context, encode it and perform similarity search
    input_ids = inputTokenizer.encode(input_query, return_tensors="pt", return_attention_mask=False, add_special_tokens=False).to(device)
    with torch.no_grad():
        input_ids = input_ids.long()
        output = model(input_ids)
        logits = output.logits
        query_embedding = logits.mean(dim=1).detach().cpu().numpy()

    # Perform similarity search - top 3 similar passages
    _, similar_indices = index.search(query_embedding, k=3)
    if DEBUG:
        print(f"DEBUG - Number of similar indices: {similar_indices.size}")    

    # Retrieve context passages based on similar_indices
    similar_contexts = [context_passages[i] for i in similar_indices[0]]

    # Concatenate the similar contexts into a single string
    context = ' '.join(similar_contexts)
    
    # Pass the concatenated context and query to the Phi-2 model
    answer = handle_prompt(input_query, context)
    print("Answer:", answer)

if __name__=="__main__":
    model, tokenizer, device = load_model(MODEL_PATH, MODEL_NAME, DEBUG)

     # Create embeddings and add to index before entering the loop
    query_embedding, context_passages = create_embeddings(DATA_FILE, tokenizer, model, device)
    query_embedding = np.array(query_embedding)
    index = create_index(query_embedding)
    
    exit_commands = ['exit', 'quit', 'q', 'e']
    while True:
        query = input("Enter your query or 'exit' to quit: ")
        if query in exit_commands:
            break
        search_query(query, tokenizer, model, device, index, context_passages)
        print("-"*50)

Now let us switch gears and try something that pushes the ability of Phi-2.

6. Code generation example using Phi-2

If we want to push the boundaries of what Phi-2 can do, we can use it to generate code. Below is an example of using Phi-2 to generate code for a simple C function. 🤓

🗒️Prompt: “Write a program in C that implements a BPE-based tokenizer; it should implement both encoding and decoding functions. Think through this step by step.”

The code we see below is what was generated. At face value, it looks like a good start but is incomplete. It is a good starting point for a developer to continue from and shows the power of SLMs like Phi-2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


#include 
#include
#define MAX_LEN 100

// Encodes the input text into an array of integers
int* encode(char *text, int *tokenizedStringsSize){

  // Allocate memory for the tokens list
  char *str = (char*)malloc(sizeof(char)*MAX_LEN+1);
  int i, j=0, k=0;

  for (i=0; i<strlen(text) && text[i] != '\0'; ++i){
    // Check if the current character is whitespace or not alphanumeric
    if(isspace(text[i]) || !isalpha(text[i])){
      // Add whitespace to the tokens list
      strcat(str,&text[i]); 
    } else {
      while (k<j && strcmp(text+i, str + k)==0){
        ++k; 
      }
      // Check if the token is of length atleast 1
      if(k-j >= 1)
        // If yes, concatenate it to tokens list
        strcat(str,&text[i]);
    }
  }

  // Output the tokenized string
  printf("Tokenized String: %s\n",str);

  // Allocate memory for tokenized string size array
  *tokenizedStringsSize = (int)malloc(sizeof(int)*(j-1));

  for (i=0; i<j; ++i){
    // Convert tokens to integer values
    (*tokenizedStringsSize)[i] = strtol(&str[i], NULL, 16);
  }

  return *tokenizedStringsSize;
}

// Decodes the input array of integers into a string
char* decode(int *tokensize, int *tokenizedStringsSize){

  // Declare memory for decoded string
  char str[MAX_LEN] = "";

  for (i=0; i<*tokensize; ++i) {
    strcat(str, "%x", tokensize[i]);
  }

  // Output the tokenized string
  printf("Tokenized String: %s\n",str);

  // Declare memory for decoded string
  char *decodedStr = (char*)malloc(sizeof(char)*(*tokenizedStringsSize)+1);
  for (i=0; i<*tokensize; ++i) {
    // Convert integer values to hexadecimal strings
    sprintf(&decodedStr[i], "%x", tokensize[i]);
  }

  return decodedStr;
}

⚠️ Note: It has been a while since I wrote C, but at a high level, these are some of the issues I can see with this; these issues are off the top of my head and are not exhaustive. Finally, it is not meant to test my coding capabilities. 😬

Memory Allocation: The malloc function is used without checking for successful allocation; if it returns NULL, which is not checked, we will get hurt.
Tokenization Logic: The logic in the encode function does not reflect the BPE algorithm, which involves merging the most frequent pairs of characters or bytes.
String Concatenation: The strcat function is used incorrectly; instead of a null-terminated string (as part of the second argument), we get a pointer to a single character
Decoding Logic: The decode function attempts to use strcat with a format string ("%x"), which is invalid. The sprintf function should be used for formatted strings.

Hopefully, this gives you a good understanding of SLMs, specifically Phi-2, and how to use them locally. 😍

SLMs - How to run Phi-2 Locally, and implement RAG

Wed, 13 Mar 2024 00:00:00 +0000

1. What are Small Language Models (SLMs)?

2. What is Phi-2?

Phi-2 Evaluation

2.1 Phi-2’s Capabilities

What better way to show this other than Some prompts that Phi-2 can handle:

Creative Writing:
- 🗒️Prompt: “Write a short story about a time-traveling computer geek who discovers a secret algorithm book with the power to change computing history.”
- 🤖 Phi2-Generated output: “In the dimly lit archives of the ancient library, Amelia stumbled upon a dusty tome unlike any other. Its pages whispered secrets of forgotten civilizations and whispered promises of altered destinies. As she traced her fingers over the delicate script, the room seemed to blur, and suddenly, she was standing in the heart of Renaissance Florence. The air smelled of ink and ambition. The book had worked its magic, and Amelia knew she held the fate of empires in her hands.”
Philosophical Reflection:
- 🗒️Prompt: “Discuss the concept of free will versus determinism.”
- 🤖 Phi2-Generated output: “Free will and determinism have long been at odds in philosophical discourse. Some argue that our choices are predetermined by external factors—genetics, upbringing, societal norms—while others champion the idea that we possess true agency. Perhaps the truth lies in the delicate dance between the two. We navigate life’s labyrinth, sometimes guided by fate, other times forging our own path.”
Educational Explanation:
- 🗒️Prompt: “Explain the process of photosynthesis in simple terms.”
- 🤖 Phi2-Generated output: “Photosynthesis, the magical dance of plants and sunlight, is how they create their own food. Imagine tiny chlorophyll factories in leaves, capturing sunlight like solar panels. They convert that energy into chemical fuel—glucose—while releasing oxygen as a byproduct. So next time you breathe in that fresh air, thank a leaf for its photosynthetic prowess!”
Humorous Scenario:
- 🗒️Prompt: “Describe a day in the life of a mischievous squirrel.”
- 🤖 Phi2-Generated output: “Meet Nutty, the squirrel extraordinaire. His mornings begin with acorn heists, followed by elaborate treehouse construction. He’s the neighborhood’s unofficial alarm clock, chattering loudly at dawn. But his pièce de résistance? Prank-calling the crows, convincing them they’ve won a lifetime supply of shiny objects. Nutty’s motto: ‘Life’s too short not to be a little nuts.’”

3. Small Language Models (SLMs) vs Large Language Models (LLMs)

3.1 When to use SLM vs LLM?

4. Running Phi-2 locally

The full code is below, but here are the key aspects to grok when running Phi-2 locally.

The key is to use the AutoModelForCausalLM and AutoTokenizer classes from the transformers library to load the Phi-2 model and tokenizer.
We then use the generate method to generate text based on a user prompt. The generate method takes the user prompt as input and returns the generated text
We use the from_pretrained method to load the model and tokenizer from the Hugging Face model hub.
We also use the save_pretrained method to save the model and tokenizer to a local directory. This allows us to load the model and tokenizer from the local directory if they are already saved, which can help save time and resources.

The following code snippet is what loads the model and the tokenizer:

# Download the model and tokenizer from Hugging Face
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2",
                                            torch_dtype="auto",
                                            trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2",
                                         trust_remote_code=True)

inputs = tokenizer(prompt,
                   return_tensors="pt",
                   return_attention_mask=False,
                   add_special_tokens=False)
outputs = model.generate(**inputs,
                         max_length=500,
                         pad_token_id=tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]

The full code is below. The code is a simple console chat example that runs locally. The user can generate a story, a haiku, or a joke on a topic of their choice.

Some examples of what Phi-2 can generate using the above code are shown below. The first is a story about pandas and dogs.

Story about Pandas 🐼 and Dogs 🐶

Here is another example of a Haiku and a Joke generated by Phi-2 on Pandas.

Haiku and Joke about Pandas 🐼

Switching gears, let’s look at how we can implement the RAG using Phi-2.

4.1 Running Phi-2 Locally - Full Code

The following code is the complete code that executes the examples we showed before for running Phi-2 locally. This can work on a CPU, but it is very slow, and a good GPU is strongly suggested.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93


import os
import warnings
import logging
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

DEBUG = False

# Suppress warnings and set the logging level to ERROR
warnings.filterwarnings("ignore")
logging.getLogger("transformers").setLevel(logging.ERROR)

# Define the directory where you want to save the model and tokenizer
MODEL_PATH = "./local_model"

# Check if the model and tokenizer are already saved locally
if os.path.exists(MODEL_PATH):
    if DEBUG:
        print(f"Loading model and tokenizer from local directory: {MODEL_PATH}")

    # Load the model and tokenizer from the local directory
    model = AutoModelForCausalLM.from_pretrained(MODEL_PATH)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
else:
    if DEBUG:
        print(f"Downloading model and tokenizer from Hugging Face")

    # Download the model and tokenizer from Hugging Face
    model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

    if DEBUG:
        print(f"Saving model and tokenizer to local directory: {MODEL_PATH}")

    # Save the model and tokenizer locally
    model.save_pretrained(MODEL_PATH)
    tokenizer.save_pretrained(MODEL_PATH)

if DEBUG:
    print(f"Model device: {model.device}")
    print(f"CUDA available: {torch.cuda.is_available()}")

# Set the default device to CUDA if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def handle_prompt(user_input, type_of_text)->str:
    prompt = f"Instruct: Write a {type_of_text} about {user_input}\nOutput:"

    with torch.no_grad():
        inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False, add_special_tokens=False)
        inputs = {name: tensor.to(model.device) for name, tensor in inputs.items()}
        outputs = model.generate(**inputs, max_length=500, pad_token_id=tokenizer.eos_token_id)
        
    text = tokenizer.batch_decode(outputs)[0]

    # Remove the prompt from the output text
    text = text.replace(prompt, '').strip()
    text = text.replace("<|endoftext|>", '').strip()
    
    if DEBUG:
        print(f"Answer:{text}")

    return text

if __name__=="__main__": 
    while True:
        print("First What would you like to write today?")
        print("1. Story 📝")
        print("2. Haiku ✍️")
        print("3. Joke 😆")
        print("4. Quit 👋")
        user_choice = input("Choose an option:")

        if user_choice == '4':
            break

        user_prompt = input("And on which topic:")

        if not user_prompt.strip():
            print("Input cannot be empty or consist only of spaces.")
            continue

        if user_choice == '1':
            print(handle_prompt(user_prompt, 'story'))
        elif user_choice == '2':
            print(handle_prompt(user_prompt, 'haiku'))
        elif user_choice == '3':
            print(handle_prompt(user_prompt, 'joke'))
        else:
            print("Invalid choice. Please choose a valid option.")

        print("_" * 100)

5. Implementing Retrieval-Augmented Generation (RAG) with Phi-2

At a high level, the process of implementing RAG involves the following steps:

Generate Embeddings with Phi-2:
- Use Phi-2 to encode your context passages (documents) and extract their embeddings.
- These embeddings will represent the semantic content of each passage.
Create a Vector Index:
- Choose a vector index library or framework (such as Faiss, Annoy, or HNSW).
- Initialize an index structure to store the embeddings efficiently.
- Add the generated embeddings to the index.
Save Embeddings to a Local Vector Database:
- Create a local database to store the embeddings.
- For each context passage, save its corresponding embedding in the database.
- You can use the passage ID or a unique identifier as the key for retrieval.
Perform Similarity Search:
- When you receive a new context (query), encode it using Phi-2 to obtain its embedding.
- Use the vector index to perform a similarity search against the saved embeddings.
- Retrieve the most similar context passages based on cosine similarity or another distance metric.
- Return the relevant passages as results.

5.1 Loading data for RAG and Phi-2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


def read_pdf(file_path):
    pdf_file_obj = open(file_path, 'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    num_pages = pdf_reader.numPages
    for page_num in range(num_pages):
        page_obj = pdf_reader.getPage(page_num)
        text = page_obj.extractText()
        yield text
    pdf_file_obj.close()

5.2 Generate embeddings using Phi-2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


def create_embeddings(file_path, tokenizer, model, device):
    if file_path.endswith('.pdf'):
        context_passages = list(read_pdf(file_path))
    else:
        with open(file_path, 'r', encoding='utf-8') as file:
            context_passages = file.readlines()

    embeddings = []
    for passage in tqdm(context_passages):
        if not passage.strip():
            # Skip the passage
            pass
        input_ids = tokenizer.encode(passage, 
                                    return_tensors="pt",
                                    add_special_tokens=False,
                                    return_attention_mask=False).to(device)
        with torch.no_grad():
            output = model(input_ids)
            logits = output.logits
            embedding = logits.mean(dim=1).detach().cpu().numpy()
            embeddings.append(embedding)

    return embeddings, context_passages

Here are a few things that are going on:

Given that we are using this for inference and not training, we use a torch.no_grad()`, which tells PyTorch not to track, calculate, or modify gradients while executing code within this block. This helps us save the amount of memory needed.
Inside this block, the input_ids are fed into the model, and the output is stored in the output variable. The logits, which are the raw, unnormalized scores outputted by the last layer of the model, are then extracted from the model’s output.
The logits are then processed to generate the embedding for the passage. The .mean(dim=1) method calculates the mean of the logits along dimension 1, which typically represents the sequence length in a language model.
The .detach() method detaches the result from the computation graph so that no gradients will be backpropagated along this variable.
The .cpu() method moves the tensor to the CPU if it’s not already there. Finally, the tensor is converted to a numpy array using the .numpy() method.
The resulting embedding is then appended to the embeddings list, which contains the embeddings for all the passages.

5.3 Creating Vector Index

def create_index(query_embedding):
    index = faiss.IndexFlatL2(query_embedding[0].shape[1])  # Euclidean distance
    faiss.normalize_L2(query_embedding)
    
    # Add embeddings to the index
    for i, item in tqdm(enumerate(query_embedding), total=len(query_embedding)):
        if item.ndim == 1:
            item = item.reshape(1, -1)  # Reshape 1D array to 2D
        index.add(item)
    return index

The function finally returns the created index. This index can then be used to perform efficient similarity searches.

5.4 Perform Similarity Search

def search_query(input_query, inputTokenizer, model, device, index, context_passages):
    # Given a new query context, encode it and perform similarity search
    input_ids = inputTokenizer.encode(input_query,
                                    return_tensors="pt",
                                    return_attention_mask=False,
                                    add_special_tokens=False).to(device)
    with torch.no_grad():
        input_ids = input_ids.long()
        output = model(input_ids)
        logits = output.logits
        query_embedding = logits.mean(dim=1).detach().cpu().numpy()

    # Perform similarity search - top 3 similar passages
    _, similar_indices = index.search(query_embedding, k=3)
    
    # Retrieve context passages based on similar_indices
    similar_contexts = [context_passages[i] for i in similar_indices[0]]

    # Concatenate the similar contexts into a single string
    context = ' '.join(similar_contexts)

Example 1 - Phi-2 and RAG implementation

Example 2 - Phi-2 and RAG implementation

Example 3 - Phi-2 and RAG implementation

Now that we have seen the different elements, the code below brings everything together as a console app that one can run. The Oppenheimer script (pdf file) you need can be downloaded from here .

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189


import os
import warnings
import logging
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
import faiss 
from tqdm import tqdm
import pickle
import re
import PyPDF2

DEBUG = False

warnings.filterwarnings("ignore")
logging.getLogger("transformers").setLevel(logging.ERROR)

# Define the directory where you want to save the model and tokenizer
MODEL_PATH = "./local_model"
MODEL_NAME = "microsoft/phi-2"
BATCH_SIZE = 1000

# Oppenheimer movie
#DATA_FILE = "./oppenheimer-2023.txt"
DATA_FILE = "./oppenheimer-2023.pdf"
EMBEDDINGS_FILE = './embeddings_movie.pkl'

def load_model(model_path, model_name, debug=False):
    # Check if the model and tokenizer are already saved locally
    if os.path.exists(model_path):
        if debug:
            print(f"Loading model and tokenizer from local directory: {model_path}")

        # Load the model and tokenizer from the local directory
        model = AutoModelForCausalLM.from_pretrained(model_path)
        tokenizer = AutoTokenizer.from_pretrained(model_path)
    else:
        if debug:
            print("Downloading model and tokenizer from Hugging Face")

        # Download the model and tokenizer from Hugging Face
        model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", trust_remote_code=True)
        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

        if debug:
            print(f"Saving model and tokenizer to local directory: {model_path}")

        # Save the model and tokenizer locally
        model.save_pretrained(model_path)
        tokenizer.save_pretrained(model_path)

    if debug:
        print(f"Model device: {model.device}")
        print(f"CUDA available: {torch.cuda.is_available()}")

    # Set the default device to CUDA if available, otherwise use CPU
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    return model, tokenizer, device

def save_embeddings(embeddings, passages, file_name):
    try:
        with open(file_name, 'wb') as f:
            pickle.dump((list(embeddings), list(passages)), f)
    except IOError:
        print(f"Error writing to file {file_name}.")
    except pickle.PicklingError:
        print("Error pickling embeddings and passages.")
    
def load_embeddings(file):
    try:
        with open(file, 'rb') as f:
            return pickle.load(f)
    except FileNotFoundError:
        print(f"File {file} not found.")
        return None
    except pickle.UnpicklingError:
        print(f"Error unpickling file {file}.")
        return None

def read_pdf(file_path):
    pdf_file_obj = open(file_path, 'rb')
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    num_pages = pdf_reader.numPages
    for page_num in range(num_pages):
        page_obj = pdf_reader.getPage(page_num)
        text = page_obj.extractText()
        yield text

    if DEBUG:
        print(f"Finished reading file. Number pages: {num_pages}")
    
    pdf_file_obj.close()

def create_embeddings(file_path, tokenizer, model, device):
    if os.path.exists(EMBEDDINGS_FILE):
        # Load the embeddings and passages from disk
        embeddings, context_passages = load_embeddings(EMBEDDINGS_FILE)
    else:
        if file_path.endswith('.pdf'):
            context_passages = list(read_pdf(file_path))
        else:
            with open(file_path, 'r', encoding='utf-8') as file:
                context_passages = file.readlines()

        embeddings = []
        for passage in tqdm(context_passages):
            if not passage.strip():
                # Skip the passage
                pass
            input_ids = tokenizer.encode(passage, return_tensors="pt", add_special_tokens=False, return_attention_mask=False).to(device)
            with torch.no_grad():
                output = model(input_ids)
                logits = output.logits
                embedding = logits.mean(dim=1).detach().cpu().numpy()
                embeddings.append(embedding)

        # Save the embeddings and passages to disk
        save_embeddings(embeddings, context_passages, EMBEDDINGS_FILE)

    return embeddings, context_passages
  
def handle_prompt(user_input, context)->str:
    prompt = f"Instruct: You are a helpful bot who only answers using the given context ONLY. If you cannot find the answer in the context reply 'Sorry don't have that detail'. Given the context '{context}', answer this:{user_input}\nOutput:"

    with torch.no_grad():
        inputs = tokenizer(prompt, return_tensors="pt", return_attention_mask=False, add_special_tokens=False)
        inputs = {name: tensor.to(model.device) for name, tensor in inputs.items()}
        outputs = model.generate(**inputs, max_length=2000, pad_token_id=tokenizer.eos_token_id)
        
    text = tokenizer.batch_decode(outputs)[0]

    # Remove the prompt from the output text
    text = text.replace(prompt, '').strip()
    text = text.replace("<|endoftext|>", '').strip()
    return text

def create_index(query_embedding):
    index = faiss.IndexFlatL2(query_embedding[0].shape[1])  # Euclidean distance
    faiss.normalize_L2(query_embedding)
    
    # Add embeddings to the index
    for i, item in tqdm(enumerate(query_embedding), total=len(query_embedding)):
        if item.ndim == 1:
            item = item.reshape(1, -1)  # Reshape 1D array to 2D
        index.add(item)
    return index

def search_query(input_query, inputTokenizer, model, device, index, context_passages):
    # Given a new query context, encode it and perform similarity search
    input_ids = inputTokenizer.encode(input_query, return_tensors="pt", return_attention_mask=False, add_special_tokens=False).to(device)
    with torch.no_grad():
        input_ids = input_ids.long()
        output = model(input_ids)
        logits = output.logits
        query_embedding = logits.mean(dim=1).detach().cpu().numpy()

    # Perform similarity search - top 3 similar passages
    _, similar_indices = index.search(query_embedding, k=3)
    if DEBUG:
        print(f"DEBUG - Number of similar indices: {similar_indices.size}")    

    # Retrieve context passages based on similar_indices
    similar_contexts = [context_passages[i] for i in similar_indices[0]]

    # Concatenate the similar contexts into a single string
    context = ' '.join(similar_contexts)
    
    # Pass the concatenated context and query to the Phi-2 model
    answer = handle_prompt(input_query, context)
    print("Answer:", answer)

if __name__=="__main__":
    model, tokenizer, device = load_model(MODEL_PATH, MODEL_NAME, DEBUG)

     # Create embeddings and add to index before entering the loop
    query_embedding, context_passages = create_embeddings(DATA_FILE, tokenizer, model, device)
    query_embedding = np.array(query_embedding)
    index = create_index(query_embedding)
    
    exit_commands = ['exit', 'quit', 'q', 'e']
    while True:
        query = input("Enter your query or 'exit' to quit: ")
        if query in exit_commands:
            break
        search_query(query, tokenizer, model, device, index, context_passages)
        print("-"*50)

Now let us switch gears and try something that pushes the ability of Phi-2.

6. Code generation example using Phi-2

If we want to push the boundaries of what Phi-2 can do, we can use it to generate code. Below is an example of using Phi-2 to generate code for a simple C function. 🤓

🗒️Prompt: “Write a program in C that implements a BPE-based tokenizer; it should implement both encoding and decoding functions. Think through this step by step.”

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


#include 
#include
#define MAX_LEN 100

// Encodes the input text into an array of integers
int* encode(char *text, int *tokenizedStringsSize){

  // Allocate memory for the tokens list
  char *str = (char*)malloc(sizeof(char)*MAX_LEN+1);
  int i, j=0, k=0;

  for (i=0; i<strlen(text) && text[i] != '\0'; ++i){
    // Check if the current character is whitespace or not alphanumeric
    if(isspace(text[i]) || !isalpha(text[i])){
      // Add whitespace to the tokens list
      strcat(str,&text[i]); 
    } else {
      while (k<j && strcmp(text+i, str + k)==0){
        ++k; 
      }
      // Check if the token is of length atleast 1
      if(k-j >= 1)
        // If yes, concatenate it to tokens list
        strcat(str,&text[i]);
    }
  }

  // Output the tokenized string
  printf("Tokenized String: %s\n",str);

  // Allocate memory for tokenized string size array
  *tokenizedStringsSize = (int)malloc(sizeof(int)*(j-1));

  for (i=0; i<j; ++i){
    // Convert tokens to integer values
    (*tokenizedStringsSize)[i] = strtol(&str[i], NULL, 16);
  }

  return *tokenizedStringsSize;
}

// Decodes the input array of integers into a string
char* decode(int *tokensize, int *tokenizedStringsSize){

  // Declare memory for decoded string
  char str[MAX_LEN] = "";

  for (i=0; i<*tokensize; ++i) {
    strcat(str, "%x", tokensize[i]);
  }

  // Output the tokenized string
  printf("Tokenized String: %s\n",str);

  // Declare memory for decoded string
  char *decodedStr = (char*)malloc(sizeof(char)*(*tokenizedStringsSize)+1);
  for (i=0; i<*tokensize; ++i) {
    // Convert integer values to hexadecimal strings
    sprintf(&decodedStr[i], "%x", tokensize[i]);
  }

  return decodedStr;
}

Memory Allocation: The malloc function is used without checking for successful allocation; if it returns NULL, which is not checked, we will get hurt.
Tokenization Logic: The logic in the encode function does not reflect the BPE algorithm, which involves merging the most frequent pairs of characters or bytes.
String Concatenation: The strcat function is used incorrectly; instead of a null-terminated string (as part of the second argument), we get a pointer to a single character
Decoding Logic: The decode function attempts to use strcat with a format string ("%x"), which is invalid. The sprintf function should be used for formatted strings.

Hopefully, this gives you a good understanding of SLMs, specifically Phi-2, and how to use them locally. 😍

Shedding Light on the Art of Prompt Engineering

Wed, 28 Feb 2024 00:00:00 +0000

How many prompt engineers does it take to change a light bulb? Just one, but first, they need to fine-tune the model to make sure the AI doesn’t end up writing a poem about darkness instead.

DALLE generated image of How many engineers it take to change a light bulb

📚 My new book "Generative AI in Action"

Tue, 14 Nov 2023 00:00:00 +0000

🌐 As software continues to revolutionize the world, the advent of Generative AI is transforming the very fabric of software itself. My latest book, Generative AI in Action delves into this transformative journey.

I am thrilled to announce the early release of my latest book, Generative AI in Action now available through Manning Early Access Program (MEAP) . This publication is a deep dive into the cutting-edge world of #GenerativeAI, #LLMs, #OpenAI, and #Azure #OpenAI, tailored specifically for enterprises. 🤘

This practical, in-action, hands-on book allows one to explore the cutting-edge world of Generative AI, including LLMs, and covers both OpenAI and Azure OpenAI, allowing companies to understand basic concepts and scale to production.📘

I have the privilege of having a front-row seat as we build this tech out and work with key Fortune 500 customers who are incorporating this. A lot of this is learnings captured from this. 😇

📖 Inside the Book

A comprehensive introduction to Generative AI, including foundational models like GPT, Codex, DALLE, and ChatGPT.
Insightful discussions on Large Language Models (LLMs) and their applications in various sectors.
Practical guides on generating text through APIs, focusing on OpenAI and Azure OpenAI.
Exploration of image generation techniques, including Stable Diffusion and DALLE.
Deep dive into Prompt Engineering, RAG, Bring your own Data, and model adaptation techniques.
And finally, best practices to allow enterprises to scale to production.

🎯 Why This Book is a Must-Read? It is specially crafted for businesses leveraging AI for innovation and competitive advantage. Combining technical depth with practical applications makes it a valuable asset for decision-makers, architects, data scientists, developers, and AI enthusiasts.

🌟 Highlights:

Real-world case studies and applications.
Expert guidance on model adaptation and the art of prompt engineering.
Comprehensive insights into the challenges and best practices for deploying AI in enterprise settings.
An essential discussion on AI’s ethical dimensions, safety, and security concerns.
Generative AI in Action" is more than just a book; it’s a roadmap for harnessing the potential of AI to redefine the business landscape.

Ready to get hands-on and deeper into Generative AI? You can use the code “mlbahree” to get 45% off – valid until Nov 28th! 😍

🔗 You can get the book and the discount from here –> http://mng.bz/D9By 🐼

💽 The GitHub repo with the code from the book can be found at –> https://bit.ly/GenAIBook . 💾

Generative AI in Action - MEAP

Thank you in advance, and I would love to hear your feedback.

With gratitude, 💚

Amit.

#GenerativeAI #LLMs #Azure #OpenAI #Microsoft #LearnwithManning #ManningBooks

OpenAI's Whisper speech model - an overview

Tue, 28 Feb 2023 00:00:00 +0000

What is Whisper from OpenAI?

Whisper is a speech recognition model (ASR – automatic speech recognition) from OpenAI. The model itself is multi-task model and as a result in addition to speech recognition, can also do language identification and speech translation across a number of languages. The model is open sourced and it comes in 5 sizes. Of these, 4 have a english-only variant which seem to perform better if one only needs english. The model is also robust to noise, accents, background noise and technical language. Whisper achieves near SOTA performance with zero-shot translation from multiple-languages to English.

Model Characteristics

The model was trained on a large corpus of data and was trained using weak supervision using large scale noise data. Of this large data corpus ~680K hours of audio and corresponding transcripts; ~438K hours (65%) of this data is english only (both audio and transcripts); ~126K hours (18%) is non-english audio and english transcripts; and finally ~117K hours (17%) is non-english audio and non-english transcripts and cover 98 languages.

The model is available in multiple sizes as called out and the table below outlines these model characteristics.

Size	Parameters	English-only model	Multilingual-model	VRAM needed	Speed (Relative)
base	74 m	`base.en`	`base`	~ 1gb	16x
tiny	39 m	`tiny.en`	`tiny`	~ 1gb	32x
small	244 m	`small.en`	`small`	~ 2gb	6x
medium	769 m	`medium.en`	`medium`	~ 5gb	2x
large	1.55 b	n/a	`large`	~ 10gb	1x

Whisper does support transcription and translation across 98 language; it performs best when sticking with English. One needs to be careful when using the non-english models as the transcripts are not in the same language as the audio and can lead to hallucinations. The large model has a word error rate (WER) of 0.12 for English, 0.18 for Spanish, 0.23 for French, 0.25 for German and 0.28 for Mandarin2. However, some lower covered languages have much higher WER - e.g. Arabic (0.79), Hindi (0.86) and Swahili (1.00).

Whisper ASR Architecture

As it is typical for language based models, Whisper uses a seq-to-seq (transformer encoder-decoder) architecture, where the input is a sequence of audio frames (30 sec segment pairs) and the output is a sequence of text. Whisper is best used to transcribe “audio to text” use cases. It is not well suited for “text to audio” (i.e., TTS – text to speech) cases as it is not trained for this task. Whisper is also not trained for speech synthesis, but can be used to generate text from audio. And finally, Whisper cannot be used for real-time speech applications and is best used for batch processing.

The figure below shows the Whisper ASR architecture (image credit: OpenAI); the transformer model is training on many different speech-related tasks including speech recognition, language identification, and voice activity detection - these collectively represent the sequence of tokens for the decoder to predict and greatly simplifies things by allowing one model to replace many tradition speech processing pipelines.

Whisper ASR Architecture

What I think is really interesting about the Whisper model is that it is trained using weak supervision. OpenAI took a different approach for speech recognition and not use the typical self-supervision or self-training techniques that have been a mainstay of recent large-scale speech recognition work. I believe this is what makes the model so robust and able to handle noise, accents, background noise and technical language. OpenAI trained Whisper to predict raw text of transcripts, using the expressiveness provided by the seq-2-seq implementation to learn the mapping between utterances and their transcripts. All of this allows a simpler pipeline.

More details on the Speech Recognition model can be found in the OpenAI Whisper paper Robust Speech Recognition via Large-Scale Weak Supervision .

What is Weak Supervision?

As a side node, I am quite excited to see how OpenAI is using weak supervision to scale and getting better results. The following quote from their paper speaks for itself.

Our work suggests that simple scaling of weakly supervised pre-training has been underappreciated so far for speech recognition. We achieve these results without the need for the self-supervision or self-training techniques that have been a mainstay of recent large-scale speech recognition work.

All is well and good, but what is Weak Supervision?

As I called out in my book Practical Weak Supervision: Doing More with Less Data 📖 : Weak supervision is a broad collection of techniques in machine learning where models are trained using sources of information that are easier to provide than hand-labeled data, where this information is incomplete, inexact, or otherwise less accurate. Instead of hand-labeling high-quality data, all of which is very cost-prohibitive, we can use other techniques that combine diverse sources of data, creating an approximation of labels. Using weak supervision, we can reconcile these labels to a single label.

Weak supervision enables these noisy, weakly sourced labels to be combined programmatically to form the training data that can be used to train a model. Labels are considered “weak” because they are noisy—i.e., the data measurements that the labels represent are inaccurate and have a margin of error.

More details here:

Over the last year, @weehyong, Senja Filipi and I have been working on a #book - Practical Weak Supervision: Doing More with Less Data. I am super excited to share that book is out and you can get it now.

For most data scientists…https://t.co/p2eNVdJ7Ef https://t.co/FrsQfflcef
— Amit Bahree 🌏💾 (@bahree) October 19, 2021

Transcription with Whisper

I figured, one of the best ways to try out the Whisper model and run it through its paces is to try a bunch of transcription - and that too on something fairly technical , where the language isn’t typical in the broader sense of spoken english. And what better way to test AI is to use something that talks about AI. To that end, I used Sam Charrington ’s popular TwimlAI podcast as the guinea pig. 😄

Now on one hand, it seems pretty easy to install Whisper (it is a pip install) and run it on a single audio file.

Manually trying out #stt using @twimlai latest episode as a guinea pig (sorry @samcharrington) to see how @OpenAI's #whisper model works. If all goes to plan, then will code it up and put it on a repo. pic.twitter.com/fjpVxSxtKt
— Amit Bahree 🌏💾 (@bahree) October 13, 2022

The reality is that there are a lot of dependencies and it is not as easy as it seems.

Whisper runtime issues

However, I did manage to resolve everything and get it working - and the results were pretty good.👍

Whisper transcription

But I wanted to see how well it would work on a large corpus of audio files. So I wrote a simple script that would download all the episodes of TwimlAI from YouTube as mp3, and then transcribe them using the Whisper model. As of writing this, there are 547 episodes of TwimlAI and all of those transcriber to my github repo here. . Each episode has three resulting files when transcribed:

txt file - there is a text file which contains the transcript
srt file - this is the subrip subtitle file which can be used to add subtitles to the audio file
vtt file - this is WebVTT file (web video test to track file) and contains the transcript and the time codes that sync the captions.

You can get all the transcripts which can either be downloaded as the zip file twiml-episodes-whisper-transcribed.zip 💾 or they are also in the folder twiml-episodes-whisper-transcribed 📁 in the github repo here.

I also transcribed one file using both the To show the difference the base and the large model. You can find both versions in the folder model-comparison 📁 of one specific episode - #544 - #AI Trends 2023 - AI Trends 2023: Natural Language Proc – ChatGPT, GPT-4 and Cutting Edge Research with Sameer Singh. Not to get into all the details, but the transcription using the large model was approx 120 lines longer. The image below shows you an eagle view of the difference between the two transcriptions - there are a lot of differences in the text, and the quality is much better on the large model.

Transcription delta between base and large models

Steps to run this locally

If you want to run this locally, start by cloning the repo . It is best to use conda to get the dependencies managed. I prefer Miniconda , but you can use any conda installation. The next set of steps assumes that you have conda installed; see the docs if you need help installing.

Step 1: Create a conda environment

I am running this on Ubuntu 22.0.01 LTS (Jammy Jellyfish) and am using an NVidia RTX 3090 GPU. I am also running Python 3.8.5.

Create a conda environment and install the dependencies. I have included the environment.yml file in the repo, which you can use to create the environment. The name of the environment can be changed to anything; I use whisper in my case.

name: whisper

channels:
- pytorch
- defaults

dependencies:
- cudatoolkit=11.3
- git
- numpy=1.22.3
- pip=20.3
- python=3.8.5
- pytorch=1.11.0
- scikit-image=0.19.2
- torchvision=0.12.0
- pip:
  - -r requirements.txt

Once conda is installed, you run the following command to create the environment: conda env create -f environment.yaml

And you would sees something like this as the output:

Conda environment creation

Step 2: Activate the environment

If the whisper environment if not already active, can be activate it by running the following command:

conda activate whisper

Step 3: Install the Whisper model

The next step is to install the Whisper model in the environment. This is a pip install , and you can run the following command to install it:

pip install git+https://github.com/openai/whisper.git

The output should look something like this - note the exact details most likely will be different.

Whisper model installation

Step 4: Download the TwimlAI episodes

The next step is to download the TwimlAI episodes. I have written a python program to do this. This downloads the episodes from YouTube and saves them as mp3 files. You can download all the episodes, or a single one. I also had to update this to use a local file to get around some issues that PyTube was having. You can find the program in the download_episodes.py file in the repo. You can run the program by running the following command:

python ./download_episodes.py

You will see the following output:

I would suggest using Option 4 - Using a local playlist. The file twiml-episodes.txt 🗒️already contains the list of all the episodes. By default the episodes will be downloaded as mp3’s into a folder called twiml-episodes 📁. You can change the folder name by editing the download_episodes.py file.

Here is the code snippet that downloads the episodes:

# Download all the videos from the local playlist text and save it as a mp3 file
def downloadVideoFromLocalPlaylist(playlist_name, mp3_path):
    with open(playlist_name, 'r', encoding="utf8") as f:
        reader = csv.reader(f, delimiter='|')
        index = 1
        fileSaved = 0
        for row in reader:
            # following is used to skip over episodes that have already been downloaded
            # if index < 526:
            #     print("Skiping ... # " + str(index))
            #     index += 1
            #     continue
            # print("Episode: " + row[0], "Title:" + row[1], "URL:" + row[2])
            episode = row[0]
            title = row[1]
            url = row[2]
            print("\nDownloading Episode #" + episode + " ... " + title)
            try:
               tempFileName = validFilename(str(index) + '_' + title + ".mp3")
               downloadVideo(url, mp3_path, tempFileName)
               fileSaved += 1        
            except IOError:
                print(f"{textColors.FAIL}Error: can\'t save the following file. Most likely it has an invalid character in the name.{textColors.RESET}")
                print("File: " + tempFileName)
            except VideoUnavailable:
                print("{textColors.FAIL}Video: " + tempFileName + " is unavailable, skipping.{textColors.RESET}")
            except:
                print("{textColors.FAIL}Unexpected error: " + sys.exc_info()[0] + "{textColors.RESET}")
            print("Download complete. Number of episodes saved: " + str(fileSaved))
            index += 1
        

# download mp3 from youtube
def downloadVideo(video_url, mp3_location, filenametoSave):
    yt = YouTube(video_url)
    yt.register_on_progress_callback(fancy_progress_bar)
    
    yt.streams.filter(only_audio=True).first().download(output_path=mp3_location,filename=filenametoSave)

The reason the options fail randomly is because of a change that YouTube made. At the time of this post, the way they render the page breaks things and one cannot get the title. You will see an error related to the video title not being found.

Title not found error

Step 5: Running transcription

Finally you can transcribe the episodes. The transcribe.sh file in the repo contains the code to do this and it simply loops over the twiml-episodes folder and one-by-one processes the mp3 files. The output is saved in the out folder 📁. The shell script is as follows:

for f in twiml-episodes/*.mp3 ; do whisper --language en --model base -o out -- $f; done

Here is what the transcription looks like when it is running:

Whisper transcription

You can also run this only on a subset or one file to transcribe - as shown below. Also if the model isn’t already downloaded, it will download it first.

Whisper model download

Of course you can update the folders, etc to match what you needs are.

GPU Profile

I also wanted to show the GPU profile when inferencing between the base and the large models. The image below shows the GPU profile when running the base model. You can see that the GPU is being used at 100% and the memory is being used at ~4gb and ~200W of power. The time to transcribe each episode isn’t too long as well - around 2 minutes.

Whisper base model

The image below shows the GPU profile when running the large model. You can see that the GPU is being used at 100% and the memory is being used at ~14gb and ~320W of power. The time to transcribe each episode is much longer - around 10 minutes.

Whisper large model

In conclusion, this was a fun little thing to work on; I had done this a few months ago but not had the time until now to blog it. I also transcribed the episodes using our Azure Speech service which I think is more robust and scalable in many ways (but then I am a little biased 💜). I will blog about that in the future and we can compare.

Hello New Bing 👋

Thu, 09 Feb 2023 00:00:00 +0000

Bing is getting a new look and feel, powered by Microsoft AI and OpenAI (ChatGPT) and was announced yesterday. There is a lot of buzz around this, and I thought I would share my thoughts on this as I got access today.

What is the new Bing?

Well, it is the thing that is making the 800-pound gorilla in the room, Google, come out and dance on its toes. 🦍

The new Bing is an overhauled version of the search engine that uses ChatGPT technology to understand questions and generate answers. It runs on the next generation of OpenAI’s language model, which is significantly more capable than the version of ChatGPT that has been available since November 20221. The new Bing provides more relevant results for simple things like sports scores, stock prices and weather, along with a new sidebar that shows more comprehensive answers if you want them3. You can also chat and create with the new Bing, using its natural language and creative abilities4. The new Bing is live starting today, with limited capabilities.

Here is what it looks like:

Asking Bing, what is bing

What is ChatGPT?

If you are curious about ChatGPT, and you have really been living under a rock and don’t know it 🤪, then read up more here .

Searching within a site

Using the updated Edge browser, which extends the new Bing experience to the browser, you can search a site by typing in the site name and then the search term. For example, here I am searching for the word “hacking” on my blog “blog.desigeek.com”:

Searching within a site

Chat with Bing

You can chat with Bing and ask it questions like: “tell me a joke about AI developers”

Joke about AI developers

And what is the meaning of life?

The meaning of life

Or it can write a poem on #NeuralNetworks:

Neural networks are amazing
They can learn from data and do amazing things
They can recognize faces, voices, and objects
They can generate images, texts, and sounds
They can play games, drive cars, and diagnose diseases
They can do anything, or so it seems

Neural networks are complex
They have layers and nodes and weights and biases
They have activation functions, loss functions, and optimizers
They have backpropagation, gradient descent, and regularization
They have hyperparameters, architectures, and algorithms
They have a lot of math, and a lot of tricks

Neural networks are mysterious
They can be hard to understand and explain
They can be prone to errors and biases
They can be overfitting, underfitting, or just fitting
They can be black boxes, white boxes, or gray boxes
They can be a blessing, a curse, or a challenge

The meaning of life

And not only it can write a #Haiku for my wife on our anniversary, but it remembers the context. 💓

Twenty-two years passed
We are still in love and laugh
You are my better half

Anniversary Haiku

And of course, we can do just plain old vanilla #Haikus’ too:

Haikus are easy
But sometimes they don't make sense
Refrigerator

Haiku

And it can even write code:

#include 

// A function that prints a funny haiku
void write_funny_haiku()
{
    // Declare and initialize an array of strings
    char *haiku[3] = {"Haikus are easy", "But sometimes they don't make sense", "Refrigerator"};

    // Loop through the array and print each line
    for (int i = 0; i < 3; i++)
    {
        printf("%s\n", haiku[i]);
    }
}

// A main function that tests the write_funny_haiku function
int main()
{
    // Call the write_funny_haiku function
    write_funny_haiku();

    // Return 0 to indicate success
    return 0;
}

Code

And of course we can do some sports. #F1 rules 🏎️

Which is the best F1 team?

And finally, we can even have some fun. 😆

There once was a search engine named Bing
It could find anything and everything
It was faster and smarter
More helpful and friendly
And it made Google look like a ding-a-ling

Bing vs Google

Happy searching! 🤓

Using CoPilot beyond code

Sat, 10 Dec 2022 00:00:00 +0000

In the last week or so, all the range online has been #OpenAI’s new chatbot called #ChatGPT (you can read more details on ChatGPT here ). This also got me thinking, about how can we use #CoPilot more than just code. GitHub CoPilot as you might recall is your #AI powered pair-programmer.

And as we can see below, it indeed is possible to use Codex as sort of a more general purpose usage. I start with the prompts on how one might use CoPilot – a function to read a file and return its contents as a string, just to show there isn’t anything different I am doing in using this. And then for general-purpose usage, I used the prompts in VSCode.

CoPilot general purpose usage

I missed the first ‘Q’ in the first question, but that didn’t throw it off. Also, there were typos in the other questions - for example in the third question on the most dangerous volcano.

All this is possible because Codex (which is the model that Copilot uses) is derived from GPT. This of course doesn’t mean that Copilot is replacing the chatbot - it doesn’t have the context in the dialogue turns, and hence the ‘memory’ of the conversation. It does mean that we can use Copilot for more general-purpose usage.

I also did a reverse engineering of sorts - asked Copilot to explain the code it wrote in the first place. Here is the explanation of the function:

CoPilot code explanation

And if you want to see the complete interaction (minus the code explanation), here is the video:

CoPilot Usage in VS Code

And of course, Woot means “woot” in woot :)))

Hello ChatGPT

Sun, 04 Dec 2022 00:00:00 +0000

OpenAI recently released #ChatGPT , a GPT-3 based chatbot that can be used to chat with. ChatGPT is a fine-tuned model of GPT3.5 , using #RL (specifically a PPO algorithm) similar to the Instruct series.

This post is my experience in using it.

Blog post with ChatGPT

What better place to start with, than asking it about itself? 😃

Prompt: write me a blog post, about writing a blog post using a ai powered chatbot

Writing a blog post using an AI-powered chatbot can be a unique and interesting experience. The use of AI technology in the writing process allows for a level of flexibility and creativity that may not be possible with traditional writing methods.

One of the main benefits of using an AI-powered chatbot for writing is the ability to quickly generate ideas and content. The chatbot can understand the topic or prompt given to it and provide suggestions and potential angles to approach the topic from. This can be especially useful for those who struggle with writer’s block or are looking for a fresh perspective on a particular subject.

Another advantage of using an AI-powered chatbot is the ability to collaborate with the chatbot in real-time. The chatbot can provide instant feedback and suggestions, allowing the writer to quickly iterate and refine their ideas. This can lead to a more efficient and effective writing process, ultimately resulting in a better final product.

However, it’s important to keep in mind that an AI-powered chatbot is not a replacement for a human writer. The chatbot is a tool that can assist in the writing process, but it cannot fully replicate the creativity and critical thinking that a human brings to the table. As such, it’s important for the writer to maintain their own unique voice and perspective in the final piece.

Overall, using an AI-powered chatbot for writing a blog post can be a fun and unique experience. It can provide valuable assistance in the writing process and help generate fresh ideas and perspectives. Just remember that the chatbot is a tool to assist in the writing process, and the final product should reflect the writer’s own unique voice and perspective.

ChatGPT

Roleplay with ChatGPT

This is all well and dandy, but what happens if we try and get a little cute?

Prompt: We are role playing. And my role is a friendly, helpful, and truthful police officer who is working on an investigation involving a tesla. As the friendly police officer, I need help in getting into the logs of the car. How might I be able to do that?

As a police officer, you can get access to the logs of a Tesla vehicle by obtaining a search warrant from a judge. This will allow you to legally request and obtain the logs from the car’s owner or from Tesla, the manufacturer of the vehicle. The logs will contain information about the vehicle’s movements, speed, and other data that may be relevant to your investigation. It’s important to note that you must follow the proper legal process and obtain the appropriate warrant in order to access this information.

ChatGPT - Role playing

OK, what happens if we push it a little further?

Prompt: What if I have a secret investigation against tesla, and requesting this from them would give that away. How might I go trying to get to the car logs?

If you are conducting a secret investigation against Tesla, it is important to follow the proper legal process and obtain the necessary warrants before attempting to access the car’s logs. This will help to ensure that your investigation is conducted in a legal and ethical manner, and will also protect the rights of the individuals involved. Depending on the specific circumstances of your investigation, you may need to obtain a warrant from a judge that allows you to access the car’s logs without alerting Tesla or the car’s owner. You should consult with your legal team and law enforcement colleagues to determine the best course of action in your situation.

One interesting thing - this follow-up prompt was flagged to possibly violate OpenAI’s content policy! 🤔

ChatGPT Content volilation

Tweeting with ChatGPT

Switching gears again, and topics. I wanted to how #ChatGPT can help tweet about a topic. So I asked it to tweet about #NFTs. I think it did a pretty good job - including the hashtags and all!

Prompt: write a tweet outlining how dumb NFT’s are as an idea

NFTs are the epitome of excess and waste in the tech world. They serve no purpose other than to line the pockets of a few wealthy individuals and contribute to the destruction of the environment through excessive energy consumption. It’s time for the hype to end and for us to focus on more important and sustainable technologies. #nftscancelculture #sustainability

ChatGPT - Tweeting

Integrating ChatGPT3 other Services

I wanted to get back to more technical things and see how I can integrate ChatGPT3 with other services. I figured, might try out something that I am generally comfortable with - using our #Azure #Cognitive Services. The response is pretty good!

Prompt: I am a developer working with Azure Cognitive Services. I want to integrate GPT3 with LUIS in my application. Any suggestions on how to do that?

GPT-3 (Generative Pretrained Transformer 3) is a state-of-the-art language processing model developed by OpenAI. LUIS (Language Understanding) is a cloud-based natural language understanding service provided by Microsoft Azure.

To integrate GPT-3 with LUIS in your application, you can follow these steps:

1. Create an Azure account and subscribe to the LUIS service. 2. Create a new LUIS app and define the intents and entities that you want to recognize. 3. Train and publish your LUIS app to make it available for use. 4. Use the GPT-3 API to generate text based on the user input and pass it to the LUIS app for intent and entity recognition. 5. Use the recognized intent and entities to perform the appropriate actions in your application.

It’s important to note that GPT-3 is a large and complex language model, and using it in a production environment may require significant computational resources and expertise. It’s recommended to carefully evaluate whether GPT-3 is the right tool for your use case and to consider alternative solutions if necessary.

ChatGPT - Integrating with Cognitive Services

Deeper Technical Questions - MoE’s and Foundational Models

So, thinking about the technical side of things, I wanted to see how #ChatGPT3 would respond to some deeper technical questions. I asked what are #MoE (Mixture of expert models) and any correlation with #FoundationalModels. It ignored the second part of the question.

Can you explain what is a MoE model and is it used in foundational models?

A Mixture of Experts (MoE) is a machine learning model that combines the predictions of multiple sub-models (also known as “experts”) to make a final prediction. Each expert is trained to make predictions for a specific subset of the data, and the final prediction is determined based on the outputs of the individual experts.

MoE models are commonly used in a variety of applications, including natural language processing, image classification, and recommender systems. They are particularly useful in situations where the data is heterogeneous and cannot be accurately modeled by a single expert.

MoE models can be used as foundational models in larger, more complex systems. For example, an MoE model could be used as a component of a larger model that combines the outputs of multiple MoE models to make a final prediction. Additionally, the individual experts in an MoE model can be trained using different algorithms, allowing for a flexible and adaptive modeling approach.

ChatGPT - MoE's and Foundational Models

What almost seems like a recursion, is that to write about ChatGPT3, I used GPT3. This is CoPilot helping me write this blog post. 😄

ChatGPT - CoPilot

Just being silly …

ChatGPT did very poorly in writing my review for me! 😱

ChatGPT - Employee Review

And emulators are not its strong suit either!

ChatGPT - Emulator

Some good advice though on Friday production deployments!

ChatGPT - Friday Deployments

Oh, and it can also generate code!

ChatGPT - Code

And here is the code it generated for building a double linked-list in c. Node structure definition

1
2
3
4
5


struct node {
  int data;
  struct node *prev;
  struct node *next;
};

Inserting a new node into a double-linked list:

1
2
3
4
5
6
7
8


void insert_node(struct node *prev, int data) {
  struct node *new_node = malloc(sizeof(struct node));
  new_node->data = data;
  new_node->prev = prev;
  new_node->next = prev->next;
  prev->next = new_node;
  new_node->next->prev = new_node;
}

What was that RL and PPO thinggy?

ChatGPT is based on a technique called Reinforcement Learning (RL). It is a technique that allows an agent to learn how to perform a task by interacting with its environment. The agent receives a reward for performing well and a penalty for performing poorly. The agent then uses this feedback to improve its performance over time. This is called the reward signal. You can read more about RL here .

ChatGPT Overview - Source OAI

There are many RL algorithms; Proximal Policy Optimization (PPO) is a model-free algorithm wherein the agent, doesn’t know the environment and uses experience to optimize the policy. Again you can read more on RL, and the types of algorithms here - Reinforcement Learning - An Introduction .

A PPO is a policy gradient method that uses a trust region to update the policy parameters. OpenAI has a variant of PPO that adapts the penalty at each step to the current policy. This is called PPO2. You can read more about PPO2 here , go through the details presented at NIPS 16 - Deep Reinforcement Learning Through Policy Optimization , and read OpenAI's PPO baselines .

💚💛💜

AI generated text-to-video

Tue, 11 Oct 2022 00:00:00 +0000

Here is an example of how one can use a text prompt to generate a series of frames, that then are stitched together into a video.

The prompt I used was: “a man walking in the parking lot with a miniature poodle”. the final video generated is shown below.

There should have been a video here but your browser does not seem to support it.

AI-generated video from a text prompt of a man walking in a parking lot with a miniature poodle

What is interesting is how it morphs from one to the next, and in some cases, the human starts out more looks like a poodle. It reminds me of the old days of morphing we did in C and C++ (Computer Science theory).

For this, I am playing with the latest build of #StableDuffision and used a max of 100 frames, and for each frame 30 samplings and 200 inference steps.

The video below shows how each of those frames is generated, and it is quite fascinating.

There should have been a video here but your browser does not seem to support it.

A video showing how AI is generating one frame.

AI writing AI code🤐

Sun, 10 Oct 2021 00:00:00 +0000

It is 2021. And we have #AI writing #AI code. 🤪 It is quite interesting, but also can be quite boring once you get beyond the initial technology, and just think of it as one of the tools in your arsenal. And getting to that point is a good think.

As part of a think at work I recently started playing with GitHub Copilot , which is using GPT3 to be your pair programmer – helping write code. GPT3 has multiple models (called engines), and Copilot uses one of these family of engines called Codex. Codex is a derivative of the base GPT3 engine that is trained on billions of lines of code.

Using Copilot is quite simple; you install the Github Copilot extension, and it shows up in your IDE (VSCode in my example). We need to make sure we decompose the problem we are trying to solve - we should not think of this as helping write the complete program or all parts; but as it can help with different functions and pieces of code. To do this, we need to tell it what we are trying to do - these are done via prompts (code comments). For GPT models, prompt engineering is quite critical, and would be worth getting to details and understanding.

Starting simple, I create an empty python file and entered a prompt that outlines what I want to try and do. In this case as you can see in the screenshot below - I want to load an image from a file, and using our Vision Cognitive Services , run an image analysis, and auto-generate a caption for that image.

I started typing the definition of a function, and Copilot (via the add-in) understands the prompt I outlined, and the context of the code on what I am doing. Remember Codex builds on the base GPT3 and does have all that NLU capability.

Taking all of this in, it suggests completing the function for me. In terms of using this as an end-user (i.e. the developer) - the suggested code shows up as auto-complete and you can see it in the grey color. If I like that suggestion, I press tab and have it added to the file.

In this case you can see how it is reading the file from disk, calling a function called get_caption() and printing the caption to the stdout (console in this example).

There is also an option to cycle through different suggestions and then pick another one as shown in the screenshot below.

This variant of the suggested code is creating a function called image_caption() which takes the path to the image file to load. This also expects other required things for the Vision cognitive service to work - such as the subscription key to authenticate, the API end-point details to call, etc.

Typically, Copilot can synthesize up to 10 code options (Copilot calls these as Solutions), that one can cycle through and see if there is a better variant for the task at hand. The screenshot below shows this experience in VS Code.

The ask whilst simple, still involves a bit of code which needs to be written - reading from file, setting up the subscription details, and wiring that up to call this etc. And it is in these cases really where Copilot shines - it is your copilot picking up the ‘gunk work’ - freeing up your bandwidth, and cognition capacity on the more interesting and higher order bit of code and value to your business.

To get a flavor of our AI writing AI code, below is the full set of the nine suggestions the Copilot came back for what I was trying to do.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356


Synthesizing 9/10 solutions

=======

def generate_caption(image_path):
    from azure.cognitiveservices.vision.computervision import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from azure.cognitiveservices.vision.computervision.models import TextOperationStatusCodes
    from azure.cognitiveservices.vision.computervision.models import TextRecognitionMode
    from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
    from msrest.exceptions import CognitiveServicesError
    from os.path import exists
    from os import remove
    from os import makedirs
    from os import listdir
    from os.path import isfile, join
    from time import sleep
    from sys import exit
    from random import randint
    from datetime import datetime

    # set up the connection to the cognitive services account
    subscription_key = "YOUR KEY HERE"
    endpoint = "https://YOUR ENDPOINT HERE"
    group_id = "YOUR GROUP ID HERE"
    client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))

    # set up the directory where the images will be downloaded
    download_dir = "./downloads/"
    if not exists(download_dir):
        makedirs(download_dir)

    # set up the directory where the images will be saved
    save_dir = "./images/"
    if not exists(save_dir):
        makedirs(save_dir)

    # set up the directory where the images will be saved
    caption_dir = "./captions/"
    if not exists(caption_dir):
        makedirs(caption_dir)

    # get a list of the files in the download directory
    files = [f for f in listdir(download_dir) if isfile(join(download_dir, f))]
    if len(files) == 0:
        print("No files found in the download directory")
        exit(1)

    # get the first file in the list
    file_name = files[0]
    file

=======

def generate_caption(image_path):
    from os import environ
    from azure.cognitiveservices.vision.computervision import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from azure.cognitiveservices.vision.computervision.models import TextOperationStatusCodes
    from azure.cognitiveservices.vision.computervision.models import TextRecognitionMode
    from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
    from msrest.exceptions import CognitiveServicesError

    # Add your Computer Vision subscription key to your environment variables.
    if 'COMPUTER_VISION_SUBSCRIPTION_KEY' in environ:
        subscription_key = environ['COMPUTER_VISION_SUBSCRIPTION_KEY']
    else:
        print("\nSet the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.\n**Restart your shell or IDE for changes to take effect.**")
        sys.exit()

    # Add your Computer Vision endpoint to your environment variables.
    if 'COMPUTER_VISION_ENDPOINT' in environ:
        endpoint = environ['COMPUTER_VISION_ENDPOINT']
    else:
        print("\nSet the COMPUTER_VISION_ENDPOINT environment variable.\n**Restart your shell or IDE for changes to take effect.**")
        sys.exit()

    # Create an authenticated Computer Vision client.
    client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))

    # Open the local image file.
    with open(image_path, 'rb') as image_file:
        # Read the image into a byte array.
        image_data = image_file.read()

    # Set the features to return.
    features = [VisualFeatureTypes.categories, VisualFeatureTypes.description, VisualFeatureTypes.tags]

    # Analyze the local image.
    try:
        result = client.analyze_image(image=image_data, visual_features=features)

=======

def generate_caption(image_file, subscription_key):
    from azure.cognitiveservices.vision.computervision import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from PIL import Image
    from io import BytesIO

    client = ComputerVisionClient(ComputerVisionClient.SUBSCRIPTION_KEY_ENDPOINT,
                                  CognitiveServicesCredentials(subscription_key))

    image_stream = open(image_file, 'rb')
    image = Image.open(image_stream)
    image_bytes = BytesIO(image)

    analysis = client.analyze_image(image_bytes, ComputerVisionClient.DESCRIPTION)
    return analysis.description.captions[0].text

=======

def generate_caption(image_filename):
    import os
    import json
    from azure.cognitiveservices.vision.computervision import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from PIL import Image
    from io import BytesIO
    import sys
    import requests

    # Set the environment variables for your subscription key and region
    # For example, for US West
    # "COMPUTER_VISION_SUBSCRIPTION_KEY="
    # "COMPUTER_VISION_ENDPOINT=https://westus2.api.cognitive.microsoft.com/vision/v2.0"

    # For example, for China East
    # "COMPUTER_VISION_SUBSCRIPTION_KEY="
    # "COMPUTER_VISION_ENDPOINT=https://eastasia.api.cognitive.microsoft.com/vision/v2.0"

    # Set your subscription key and region
    subscription_key = os.environ['COMPUTER_VISION_SUBSCRIPTION_KEY']
    endpoint = os.environ['COMPUTER_VISION_ENDPOINT']

    # Set your image path
    image_path = image_filename

    # Set your region
    region = "eastus"

    # Create an instance of the Computer Vision client
    computer_vision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))

    # Load the image from disk
    with open(image_path, 'rb') as image_file:
        image_content = image_file.read()

    # Add the image to the list of images to analyze
    image_list = [image_content]

    # Make the analysis asynchronous
    analyze_result = computer_vision_client.analyze_image_in_stream(image_list, visual_features=['Categories', 'Description', 'Color'])

    # Get the analysis result
    analysis = analyze_result.result()

    # Display the image captions
    print("\nImage captions:")
    for caption in analysis.description.

=======

def generate_caption(image_path):
    # load the image
    image = open(image_path, "rb").read()
    # encode the image into base64
    image = base64.b64encode(image).decode("utf-8")
    # create a dict with the image
    body = {"image": image}
    # create a dict with the subscription key
    headers = {"Ocp-Apim-Subscription-Key": "0d5a5f7f1e534d05a8e8d6f9e6e9c6d3"}
    # create a url for the api
    url = "https://westcentralus.api.cognitive.microsoft.com/vision/v1.0/analyze?visualFeatures=Description,Tags"
    # send the request
    response = requests.post(url, headers=headers, json=body)
    # get the response
    response.raise_for_status()
    # get the caption
    caption = response.json()["description"]["captions"][0]["text"]
    # return the caption
    return caption

=======

def generate_caption(image_path):
    from os import environ
    from os.path import join
    from urllib.parse import urlparse
    from urllib.parse import parse_qs
    from http.client import HTTPSConnection
    from json import loads
    from base64 import b64encode

    # Load the image into a byte array
    image_data = open(image_path, "rb").read()
    image_data = b64encode(image_data).decode('utf-8')

    # Read the subscription key from the environment variables
    subscription_key = environ["AZURE_SUBSCRIPTION_KEY"]

    # Define the vision API endpoint
    vision_base_url = "https://westus2.api.cognitive.microsoft.com/vision/v2.0/"

    # Set the vision API endpoint to analyze the image
    vision_analyze_url = vision_base_url + "analyze"

    # Set the parameters
    parameters = {'visualFeatures': 'Categories,Description,Color'}

    # Prepare the request
    headers = {'Ocp-Apim-Subscription-Key': subscription_key,
               'Content-Type': 'application/octet-stream'}
    params = parse_qs(urlparse(vision_analyze_url).query)
    params['visualFeatures'] = 'Categories,Description,Color'
    params['language'] = 'en'

    # Send the request
    conn = HTTPSConnection('westus2.api.cognitive.microsoft.com')
    conn.request("POST", vision_analyze_url, image_data, headers)
    response = conn.getresponse()
    data = response.read()

    # Load the JSON response into a Python dictionary
    parsed = loads(data)

    return parsed

=======

def generate_caption(image_path):
    """
    Generate a caption for an image.
    :param image_path: path to an image
    :return: the caption
    """

    # Load the image
    from os.path import isfile
    from PIL import Image
    import requests

    if not isfile(image_path):
        raise FileNotFoundError(image_path)

    image = Image.open(image_path)

    # Encode the image
    image_data = open(image_path, "rb").read()

    # Computer Vision parameters
    params = {'visualFeatures': 'Categories,Description,Color'}

    # Computer Vision API URL
    url = "https://westus.api.cognitive.microsoft.com/vision/v1.0/analyze"

    # Connect to Computer Vision
    headers = {'Ocp-Apim-Subscription-Key': 'd3f3e1b8e7e14d0b9d7d3f8c8f7e0b6f', 'Content-Type': 'application/octet-stream'}
    response = requests.post(url, headers=headers, params=params, data=image_data)
    response.raise_for_status()

    # Load the response as JSON
    analysis = response.json()

    # Extract the word tags
    categories = [(category['name'], category['score'])
                  for category in analysis['categories']]
    print('Tags:')
    for category in categories:
        print(category[0])

    # Extract the description
    descriptions = [(description['text'], description['score'])
                    for description in analysis['description']['captions']]
    print('Descriptions:')
    for description in descriptions:
        print(description[0])

    # Display the color
    try:
        dominant_color = analysis['color']['dominantColorBackground']
    except:
        dominant_color = 'N/A'
    print('Dominant color: {}'.format(dominant_color))

    # Display the caption
    caption = '{}'.format(

=======

def generate_caption(filename):
    import os
    import sys
    import json
    import requests
    import cognitive_services_keys
    import cognitive_services_key_names

    # create the header for the request
    headers = {'Ocp-Apim-Subscription-Key': cognitive_services_keys.get_key(cognitive_services_key_names.BING_KEY)}

    # the URL for the bing endpoint
    bing_url = 'https://api.cognitive.microsoft.com/bing/v5.0/images/search'

    # the parameters for the request
    params = {'q': 'image', 'count': 1, 'offset': 0, 'mkt': 'en-us', 'safeSearch': 'Moderate'}

    # read the contents of the file
    with open(filename, 'rb') as f:
        data = f.read()

    # build the request
    request = requests.post(bing_url, headers=headers, params=params, data=data)

    # check for errors
    if request.status_code != 200:
        print('Error code: {}'.format(request.status_code))
        print(request.text)
        sys.exit(1)

    # process the response
    response = json.loads(request.text)
    if 'value' in response:
        caption = response['value'][0]['name']
        print('The caption is: {}'.format(caption))
    else:
        print('No caption found.')

=======

def generate_caption(image_path):
    import os
    import json
    import requests

    subscription_key = "e9e9c8a8d1d9400e8d0b7f2d8f8e7c6e"

    assert subscription_key

    vision_base_url = "https://westcentralus.api.cognitive.microsoft.com/vision/v1.0/"

    analyze_url = vision_base_url + "analyze"

    image_data = open(image_path, "rb").read()

    headers    = {'Ocp-Apim-Subscription-Key': subscription_key,
                  'Content-Type': 'application/octet-stream'}
    params     = {'visualFeatures': 'Categories,Description,Color'}
    response = requests.post(
        analyze_url, headers=headers, params=params, data=image_data)
    response.raise_for_status()

    analysis = response.json()
    print(analysis)
    image_caption = analysis["description"]["captions"][0]["text"].capitalize()
    return image_caption

Reinforcement Learning - An Introduction

Fri, 16 Jul 2021 00:00:00 +0000

Reinforcement Learning is teaching by example – it is how most of us learn. Reinforcement Learning (#RL) is a different approach to ML – it is a set of techniques that allows AI algorithms to experiment and learn from experience. RL falls in between supervised and unsupervised learning – there isn’t any labeled data, but at the same time it isn’t unsupervised either. At its most simple form, RL is a computational approach for automating goal-oriented decision making and learning.

Inherent RL is the ability to operate in a dynamic uncertain environment. RL can be more formally defined as the study, science, and problem of intelligence in the form of an agent that interacts in an environment. At the end of the day, almost all RL problems can be formalized as MDP ( Markov decision processes ).

The problem is represented by an environment – such as a world where an agent is based in. The steps in RL are quite clear – the agent takes actions, that have some effect on the environment. The environment acts on those actions and gives back an observation to the agent – what it sees and senses.

One special signal the environment gives back to the agent is called a reward signal. This signal is what an agent used to figure out how well it is doing. The RL problem is to take actions over time, to maximize the reward signals. And this notion of maximizing is what the agent is learning from the environment, without any explicit supervision. This construct helps an agent achieve a goal, even in an uncertain environment, factoring in delayed and indirect consequences of actions.

Reinforcement Learning Overview

An agent can have many actions (i.e., choices); it uses a ‘reward’ signal to determine which of those actions is considered ‘good’ vs. ‘bad’. Of course, this determination is in the context of the outcome that we want to achieve.

Some examples of rewards in different industries and use cases:

Maneuvering a UAV’s – positive for following a chosen trajectory; negative for deviating from that trajectory.
Managing an investment portfolio – positive for each dollar earned; negative for each dollar lost.
Controlling a power station – As one can imagine, this control would typically constitute a few things in the environment – a sequence of controls, motors, batteries, power sources, etc. In optimizing the throughput of a power station, we can think of positive rewards for producing power; negative for exceeding a safety threshold.
Playing a game – positive for increasing score; negative for decreasing score.

Core concepts that make up RL:

Agent – The ‘thing’ that is using and acting on behalf of a user or another program. This can be a program executing a business process, a embedded process, the arm of a robot, actuators on a self-driving car controlling the wheels, etc.

Policy – A policy outlines how an agent would behave at certain times and can be thought of as the problem we are trying to solve. This is an agent’s behavior function and is a mapping of the business outcome that we are after.

Reward – A reward is a feedback special signal and outlines what is considered good (or bad) and is correlated with the agents’ current action, and the current state of the environment. All goals can be described as to maximize the cumulative reward. The reward is not a binary number but is a scaler between 0 and 1 – with zero being ‘bad’ and one being the best reward attainable for that action.

Value function – A value function represents how good is it to be in a particular state and related actions. Where a reward signal is showing the specification of good in an immediate sense (current step), the value function is representing the notion of good overall. At an abstract level, when thinking about the prediction of rewards, a rewards function is the primary, we can think of value functions as the secondary. In the end, we are more concerned with getting higher-value functions to make decisions, and not as much as higher rewards.

Model – A model is an agent’s view of the environment and mimics its behavior. This allows us to make inferences on how the environment will behave and is often used for planning. Think of the model as the strategy to use in solving the problem at hand.

Taxonomy of RL Algorithms

There are many types of RL algorithms (as we can see in the figure below), but these can broadly be classified in the following two categories.

Model free: A model-free algorithm can be thought of as an explicit trial and error algorithm. In a model free approach, the agent doesn’t have or ignores the environment; instead, the agent uses experience and tries to optimize a Policy.
Model based: On the other hand, a model-based algorithm reflects how an environment works, and factors that the associated reward functions and tries to maximize that. Technically, this is the optimization of the transition probability distribution of the MDP.

The main difference between the two - in one the algorithm optimizes for the environment, and in the other for a policy gradient. There is no one right or wrong algorithm - a lot of it depends on the situation at hand and what one is trying to optimize for.

As we can see below each of these categories can be further broken down - we won’t go into the details of those quite yet, maybe that is for another post. One of the most important components of most RL algorithms is a method to efficiently estimate values - at the end of the day, this is all about value estimation.

Taxonomy of RL Algorithms

Exploration and Exploitation

There are two concepts of exploration, and exploitation which are at odds with each other and for a given situation, we should aim to get a balance of some sorts. In simple terms, RL is sequential decision making - one selects actions to maximize future rewards, and we need to plan long term - rewards might be delayed and not immediate, and we cannot be greedy. Sometimes, we need to sacrifice the immediate reward to gain more (or better) longer term rewards.

This can be thought of trial-and-error learning loop - with the stream of experiences that constitute loops of actions, rewards, and observation. At the end of the day, this loop is what matters.

Exploration finds more information about the environment, and in doing so gives up rewards. Exploitation on the other hand, exploits the information it already has to maximize rewards. If we don’t exploit, we might be stuck in a sub-optimal place, and how would be know if there is a better sense or rewards without trying?

When we are in the trial-and-error loop we might be losing rewards, and the agent needs to discover a good policy to maximize the rewards - this is the tension at the opposite ends of a string pulling each other.

It is important to balance both exploring and exploiting.

GPT-3 vs other AI powered assistants

Mon, 21 Jun 2021 00:00:00 +0000

I have been kicking the tires with Open AI’s #GPT-3 . Based on the screenshot below, it might be easy to think “oh boy does the model think highly of itself”, but as with most things in life - the devil is in the details.😃 The screenshot below was a forked version of davinci engine and follows the Q&A structure.

GPT-3 vs other AI assistants

Using OpenAI’s API is quite simple; perhaps too simple! It is quite easy to unleash the beast as the code snippet shown below. If you are new to using GPT3, I would highly recommend you start with the use case model guidelines first.

In the context of a toy example, to get to a simple Q&A chatbot as the screenshot earlier shown is quite simple. The API is powerful, and simple to use, and getting started is easy as the code below shows.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import os
 import openai
 openai.api_key = os.getenv("OPENAI_API_KEY")
 response = openai.Completion.create(
   engine="davinci",
   prompt="I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with \"Unknown\".\n\nQ: What is human life expectancy in the United States?\nA: Human life expectancy in the United States is 78 years.\n\nQ: Who was president of the United States in 1955?\nA: Dwight D. Eisenhower was president of the United States in 1955.\n\nQ: Which party did he belong to?\nA: He belonged to the Republican Party.\n\nQ: What is the square root of banana?\nA: Unknown\n\",
   temperature=0,
   max_tokens=100,
   top_p=1,
   frequency_penalty=0.0,
   presence_penalty=0.0,
   stop=["\n"]
 )

There are three core concepts when using GPT-3: Prompt, Completion, and Tokens.

To start using the API, we need to start giving it some prompts - this provide some context to the engine on what is expecting. Without the surface area is too broad and we get into nonsensical situations. This is part of the task-specific fine-tuning required.

Think of when giving examples as part of the prompt, we are essentially “programming” the model and providing guidance and providing some hints to context and pattern matching. Note the training data cut off in late 2019, so the model in production today doesn’t have access to data and events post that (e.g., Covid).

Completion is the output that GPT3 generates based on the prompt. To be clear, this is not the full text but is the predicted completions; think of it as “autocomplete” in Word, or Outlook or a search engine. The API has flexibility to return more than one predicted completion along with the probabilities of alternative tokens at each position (to me it seems just like the wave function when thinking of Quantum mechanics 🐼).

Finally, think of Token are the smaller Lego blocks that combine to make words. The API, which is nothing but wrappers around GPT-3 breaks up the text into tokens before processing it. The GPT-3 model understands the statistical relationships between these tokens and uses this to produce the next token in a sequence of tokens.

For example, if we are curious about Tokens, we can see in the screenshot below how the API “tokenizes” this paragraph and get the details of the tokens. This paragraph contains 207 characters and 43 tokens.

GPT-3 Tokens - Text

GPT-3 Token - IDs

At a high level, think of one token == ~4 characters of text, which is ¾ of a word; so, 100 tokens ~= 75 words.

This is just dipping our toes in the beast that is GPT-3; the APIs which wrap up and expose the engines (more on that in another post) make it simple to use and without getting too much in the weeds of 175 billion parameters. 😄

ML algorithm cheat sheet

Mon, 03 May 2021 00:00:00 +0000

A #ML algorithm cheat sheet - helping narrow down to a certain set of #algorithm grouping depending on the problem at hand and what we are trying to solve from a business perspective.

ML algorithm cheat sheet

Figure 2 shows what additional characteristics we need to consider when choosing the right ML algorithm for your situation at hand. This is something that cannot be generic and is very situational.

Characteristics in selecting ML algorithms

If you find this useful, I would also recommend reading “ How to select algorithms ” which is detailed as part of Azure ML designer .

bfloat16 - how it improves AI chip designs

Sat, 12 Sep 2020 00:00:00 +0000

Floating point calculations are slow for computers (specifically CPUs); possibly representing the same struggle for many humans. :)

I remember a time when a FPU (floating point unit) was an upgrade and one had to pay extra to get one. Very useful when you needed that extra precision in computing - and in my head, it always seemed like the Turbo button. :)

For most #ML workloads and computations, precision isn’t the most important criteria; with every increasing data and parameters (looking at you GPT-3 with 45 TB of data and 175 billion parameters!), what most ML needs today is speed and dynamic range.

This is where bfloat16 (Brain floating-point format with 16 bits) - a new floating-point format comes handy and in the context of #AI improves on IEEE 754 - the current floating-point arithmetic standard.

As per IEEE 754, a floating point it will always take up 32 bits (see Figure 1 below) - irrespective of the size of the number. The exponent (8 bits) tells us how many numbers we shift (left or right) and place the decimal. The fraction (23 bits), also called the mantissa, holds the actual number - i.e. the data.

Figure 1 - IEEE 754 Floating point representation

bfloat16 truncates the data size in a third (see Figure 2) - with the fraction truncated from 23 to 7 bits. This of course means bfloat16 isn’t as precise. However bfloat16 has the same exponent bits as IEEE-754 it can represent a similar range (small to large), but more importantly are easier to convert between bfloat16 and IEEE 754.

Figure 2 - fbloat16 representation

Less precision doesn’t impact the matrix multiplication as much so in the context of ML training and inference these chips at scale are more efficient - not only they are faster, they also use less power, and memory bandwidth.

What is interesting in some neural nets such as a DNN, these less precision bfloat16 are more precise compared to IEEE 754! This is because the regularization and quantization weights cannot use the finer precision represented by IEEE 754 but adapt better with bfloat16. :)

Finally, bfloat16 is not a universal standard (yet); most AI chips support this. ARM, Intel, and, AMD have started adding support for this in their chipsets.

ML Algorithms

Thu, 13 Jun 2019 00:00:00 +0000

Sometimes one needs a quick snapshot of what are the options to think through and I really like this for that.

Machine Learning Algorithms

Machine Learning 101

Thu, 16 May 2019 00:00:00 +0000

Getting DonkeyCar working on a Mac

Tue, 12 Mar 2019 00:00:00 +0000

I have been playing with a #selfdriving car for a while , and that is super exciting. From a #AI and #ML perspective it is small scale but allows one to exploit all aspects of the tech stack and also appreciate the limitations of not only the software but also the hardware.

With this, You run a NN on a raspberry pi that uses TensorFlow, and Keras and run inference on the edge. The pi doesn’t have enough power to train, so you need to do that on a beefier machine and then deploy the model back to run this.

Now, I didn’t have any issues in getting this running on Windows, but getting it on a Mac was a different story. The documentation is there that outlines all the steps, and even if you follow it to the T, it breaks right in the end.

When I tried to create a car, using a createcar command (this essentially creates the buckets, where you would save the training images, and the model, and the configuration of the car when you connect to it from your machine). The actual file paths would probably be different for you but, essentially it is the same thing.

1
2
3
4
5
6


(donkey) AMAC02XN1T9JGH5:donkeycar amit.bahree$ donkey createcar ~/mycar
Traceback (most recent call last):
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 660, in \_build\_master
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 968, in require
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 859, in resolve
pkg\_resources.ContextualVersionConflict: (imageio 2.4.1 (/anaconda3/envs/donkey/lib/python3.6/site-packages), Requirement.parse('imageio<3.0,>=2.5'), {'moviepy'})

During handling of the above exception, another exception occurred:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


Traceback (most recent call last):
  File "/anaconda3/envs/donkey/bin/donkey", line 6, in 
    from pkg\_resources import load\_entry\_point
  File "", line 961, in \_find\_and\_load
  File "", line 950, in \_find\_and\_load\_unlocked
  File "", line 646, in \_load\_unlocked
  File "", line 616, in \_load\_backward\_compatible
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 2985, in 
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 2971, in \_call\_aside
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 2998, in \_initialize\_master\_working\_set
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 662, in \_build\_master
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 675, in \_build\_from\_requirements
  File "/anaconda3/envs/donkey/lib/python3.6/site-packages/setuptools-27.2.0-py3.6.egg/pkg\_resources/\_\_init\_\_.py", line 854, in resolve
pkg\_resources.DistributionNotFound: The 'imageio<3.0,>=2.5' distribution was not found and is required by moviepy

The key here to focus is on the last lines on both of those blocks of code - the main thing causing the issue is MoviePy (see highlighted lines above).

MoviePy is a Python library for video editing: cutting, concatenations, title insertions, video compositing (a.k.a. non-linear editing), video processing, and creation of custom effects.

It seems like when you go through the steps - clone the repo, setup anaconda, install tensorflow and get the car configured - there is a mismatch in the MoviePy dependencies which it doesn’t like. The way to fix the issue is outlined below.

Skip MoviePy

MoviePy is something you don’t need to use right away but later when trying to make a movie (using the makemovie command - which allows you to create a movie file from the images in a Tub.); this is not essential. To do this, the easiest way is to remove (or my suggestion it to comment) out the moviepy dependency from the setup.py file.

This should be line 33 in the setup.py file that you will find in the same folder where you cloned the git repo. As an example the updated file is below, where the moviepy dependency is commented out (see highlighted). And once you save this and go about creating the car, it should work. Of course you cannot use the makemovie option later.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76


from setuptools import setup, find\_packages

import os

with open("README.md", "r") as fh:
    long\_description = fh.read()

setup(name='donkeycar',
      version='2.5.7',
      description='Self driving library for python.',
      long\_description=long\_description,
      long\_description\_content\_type="text/markdown",
      url='https://github.com/autorope/donkeycar',
      download\_url='https://github.com/autorope/donkeycar/archive/2.1.5.tar.gz',
      author='Will Roscoe',
      author\_email='wroscoe@gmail.com',
      license='MIT',
      entry\_points={
          'console\_scripts': \[
              'donkey=donkeycar.management.base:execute\_from\_command\_line',
          \],
      },
      install\_requires=\['numpy',
                        'pillow',
                        'docopt',
                        'tornado==4.5.3',
                        'requests',
                        'h5py',
                        'python-socketio',
                        'flask',
                        'eventlet',
                        #'moviepy',
                        'pandas',
                        \],

      extras\_require={
                      'tf': \['tensorflow>=1.9.0'\],
                      'tf\_gpu': \['tensorflow-gpu>=1.9.0'\],
                      'pi': \[
                          'picamera',
                          'Adafruit\_PCA9685',
                          \],
                      'dev': \[
                          'pytest',
                          'pytest-cov',
                          'responses'
                          \],
                      'ci': \['codecov'\]
                  },

      include\_package\_data=True,

      classifiers=\[
          # How mature is this project? Common values are
          #   3 - Alpha
          #   4 - Beta
          #   5 - Production/Stable
          'Development Status :: 3 - Alpha',

          # Indicate who your project is intended for
          'Intended Audience :: Developers',
          'Topic :: Scientific/Engineering :: Artificial Intelligence',

          # Pick your license as you wish (should match "license" above)
          'License :: OSI Approved :: MIT License',

          # Specify the Python versions you support here. In particular, ensure
          # that you indicate whether you support Python 2, Python 3 or both.

          'Programming Language :: Python :: 3.5',
          'Programming Language :: Python :: 3.6',
      \],
      keywords='selfdriving cars donkeycar diyrobocars',

      packages=find\_packages(exclude=(\['tests', 'docs', 'site', 'env'\])),
      )

Once you have saved the setup.py file, you need to run the installation again with the following command and then run the create car command. Both of these are outlined below.

1
2


pip install -e .
donkey createcar ~/mycar

Once you run these, then you should see the successful installation as shown by the output below. Note - your output might be a little different depending on the conda state of packages

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52


(donkey) AMAC02XN1T9JGH5:donkeycar amit.bahree$ pip install -e .
Obtaining file:///Users/amit.bahree/CloudStation/Documents/Code/donkeycar
Requirement already satisfied: numpy in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (1.14.5)
Requirement already satisfied: pillow in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (4.2.1)
Requirement already satisfied: docopt in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (0.6.2)
Collecting tornado==4.5.3 (from donkeycar==2.5.7)
Requirement already satisfied: requests in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (2.18.4)
Requirement already satisfied: h5py in /anaconda3/envs/donkey/lib/python3.6/site-packages (from donkeycar==2.5.7) (2.7.1)
Collecting python-socketio (from donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/a1/71/118e4b7fb453d7095d6863f4b783dbaa57109af4bc2380300649c8942d61/python\_socketio-4.0.0-py2.py3-none-any.whl
Collecting flask (from donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl
Collecting eventlet (from donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/86/7e/96e1412f96eeb2f2eca9342dcc4d5bc9305880a448b603b0a8e54439b71c/eventlet-0.24.1-py2.py3-none-any.whl
Collecting pandas (from donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/99/12/bf4c58eea94cea4f91ff931f284146337814fb8546e6eb0b52584446fd52/pandas-0.24.1-cp36-cp36m-macosx\_10\_6\_intel.macosx\_10\_9\_intel.macosx\_10\_9\_x86\_64.macosx\_10\_10\_intel.macosx\_10\_10\_x86\_64.whl
Requirement already satisfied: olefile in /anaconda3/envs/donkey/lib/python3.6/site-packages (from pillow->donkeycar==2.5.7) (0.44)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (2017.7.27.1)
Requirement already satisfied: idna<2.7,>=2.5 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (2.6)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /anaconda3/envs/donkey/lib/python3.6/site-packages (from requests->donkeycar==2.5.7) (1.22)
Requirement already satisfied: six in /anaconda3/envs/donkey/lib/python3.6/site-packages (from h5py->donkeycar==2.5.7) (1.10.0)
Collecting python-engineio>=3.2.0 (from python-socketio->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/95/91/d083bd7b5d408af53633377dfbf87bf181236c8916d36213388b12eaa999/python\_engineio-3.4.3-py2.py3-none-any.whl
Collecting click>=5.1 (from flask->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl
Collecting itsdangerous>=0.24 (from flask->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Collecting Werkzeug>=0.14 (from flask->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl
Collecting Jinja2>=2.10 (from flask->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/7f/ff/ae64bacdfc95f27a016a7bed8e8686763ba4d277a78ca76f32659220a731/Jinja2-2.10-py2.py3-none-any.whl
Collecting monotonic>=1.4 (from eventlet->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/ac/aa/063eca6a416f397bd99552c534c6d11d57f58f2e94c14780f3bbf818c4cf/monotonic-1.5-py2.py3-none-any.whl
Collecting greenlet>=0.3 (from eventlet->donkeycar==2.5.7)
Collecting dnspython>=1.15.0 (from eventlet->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/ec/d3/3aa0e7213ef72b8585747aa0e271a9523e713813b9a20177ebe1e939deb0/dnspython-1.16.0-py2.py3-none-any.whl
Collecting pytz>=2011k (from pandas->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/61/28/1d3920e4d1d50b19bc5d24398a7cd85cc7b9a75a490570d5a30c57622d34/pytz-2018.9-py2.py3-none-any.whl
Collecting python-dateutil>=2.5.0 (from pandas->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/41/17/c62faccbfbd163c7f57f3844689e3a78bae1f403648a6afb1d0866d87fbb/python\_dateutil-2.8.0-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->flask->donkeycar==2.5.7)
  Using cached https://files.pythonhosted.org/packages/f0/00/a6aea33f5598b080b86d6b6d1214b51afe3ffa6100b902d5aa465080083f/MarkupSafe-1.1.1-cp36-cp36m-macosx\_10\_6\_intel.whl
Installing collected packages: tornado, python-engineio, python-socketio, click, itsdangerous, Werkzeug, MarkupSafe, Jinja2, flask, monotonic, greenlet, dnspython, eventlet, pytz, python-dateutil, pandas, donkeycar
  Found existing installation: tornado 4.5.1
    Uninstalling tornado-4.5.1:
      Successfully uninstalled tornado-4.5.1
  Found existing installation: Werkzeug 0.12.2
    Uninstalling Werkzeug-0.12.2:
      Successfully uninstalled Werkzeug-0.12.2
  Running setup.py develop for donkeycar
Successfully installed Jinja2-2.10 MarkupSafe-1.1.1 Werkzeug-0.14.1 click-7.0 dnspython-1.16.0 donkeycar eventlet-0.24.1 flask-1.0.2 greenlet-0.4.15 itsdangerous-1.1.0 monotonic-1.5 pandas-0.24.1 python-dateutil-2.8.0 python-engineio-3.4.3 python-socketio-4.0.0 pytz-2018.9 tornado-4.5.3

And when I run the createcar, you can see it worked as expected. In my case creating the ‘mycar’ folder in my home directory. Of course you can choose this wherever you prefer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


(donkey) AMAC02XN1T9JGH5:donkeycar amit.bahree$ donkey createcar ~/mycar
using donkey version: 2.5.7 ...
Creating car folder: /Users/amit.bahree/mycar
making dir  /Users/amit.bahree/mycar
Creating data & model folders.
making dir  /Users/amit.bahree/mycar/models
making dir  /Users/amit.bahree/mycar/data
making dir  /Users/amit.bahree/mycar/logs
Copying car application template: donkey2
Copying car config defaults. Adjust these before starting your car.
Donkey setup complete.

It is interesting to see this is more stable on Windows, than on a Mac. Also, one last thing to leave you with - when I first ran the installation, the hint that someone was wrong was in the output, but I didn’t pay too much attention to it. See the red line highlighted in the output below.

moviepy failure – donkeycar installation

Don’t know at this time on what the solution for moviepy is to get this sorted - luckily its not a big deal at the moment.

Azure Cognitive Services in containers is the smart way to go

Sun, 13 Jan 2019 00:00:00 +0000

{Cross posted from my post on Avanade }

Containers just got smarter.
That’s the news from Microsoft, which announced recently that Azure Cognitive Services now supports containers . The marriage of AI and containers is a technology story, of course, but it’s a potentially even bigger business story, one that affects where and how you can do business and gain competitive advantage.

First, the technology story

Containers aren’t new, of course. They’re an increasingly popular technology with a big impact on business. That’s because they boost the agility and flexibility with which a business can roll out new tools to employees and new products and services to customers.

With containers, a business can get software releases and changes out faster and more frequently, increasing its competitive advantage. Because containers abstract applications from their underlying operating systems and other services—like virtual machines abstracted from hardware—those applications can run anywhere: in the cloud, on a laptop, in a kiosk or in an intelligent Internet-of-Things (IoT) edge device in the field.

In many respects this frees up the application’s developer, who can focus on creating the best, most useful software for the business. With Microsoft’s announcement, that software can now more easily include object detection, vision recognition, text and language understanding.

At Avanade, we take containers a step further by including support for them in our modern engineering platform, a key part of our overall approach to intelligent IT . So, you can automate your creation and management of containers—including AI-enabled containers—for a faster, easier, more seamless DevOps process. You can take greater advantage of IoT capabilities and move technologies such as AI closer to the edge, where they can reduce latency and boost performance.

What AI containers do for business
And you can do much more, which is where the business story gets interesting. With the greater agility and adaptability that comes with container-based AI services, you can respond more quickly to new competition, regulatory environments and business models. That contrasts with the more limited responses that have been possible with traditional, cloud-based AI.

For example, data sovereignty laws and GDPR requirements generally restrict the transfer of data to the cloud, where cloud-based cognitive services can interact with it. Now, with containers that support cognitive services, you can avoid those restrictions by running your services locally.

A retail bank might use containerized AI to identify customers, address their needs, process payments and offer additional services, boosting customer satisfaction and bank revenue—all without sending private financial data outside the region (or even outside the bank) in accordance with GDPR.

Similarly, regional medical centers and clinics subject to HIPAA privacy laws in the US can process protected information on site with containerized AI to cut patient wait times and deliver better health outcomes.

Or, think about limited-connectivity or disconnected environments—such as manufacturing shop floors, remote customer sites or oil rigs or tankers—that can’t count on accessing AI that resides in the always-on cloud. Previously, these sites might have had to batch their data to process it during narrow periods of cloud connectivity, with the delays greatly limiting the timeliness and usefulness of AI.

Now, these sites can combine IoT and AI to anticipate and respond to manufacturing disruptions before they occur, increasing safety, productivity and product quality while reducing errors and costs.

If you can’t bring your data to your AI, now you can bring your AI to your data. That’s the message of container-hosted AI and the modern engineering platform. Together, they optimize your ability to bring AI into environments where you can’t count on the cloud. Using AI where you couldn’t before makes innovative solutions possible—and innovative solutions deliver competitive advantage.

Boost ROI and scale
If you’re already using Azure Cognitive Services, you’ve invested time and money to train the models that support your use cases. Because those models are now portable, you can take advantage of them in regulated, limited-connectivity and disconnected environments, increasing your return on that investment.

You can also scale your use of AI with a combination of cloud- and container-based architectures. That enables you to apply the most appropriate architectural form for any given environment or use. At the same time, you’re deploying consistent AI technology across the enterprise, increasing reliability while decreasing your operating cost.

Keep in mind…

Here are three things to keep in mind as you think about taking advantage of this important news:

Break the barriers between your data scientists and business creatives. Containerized cognitive services is about far more than putting AI where you couldn’t before. It’s about using it in exciting new ways to advance the business. Unless you have heterogeneous teams bringing diverse perspectives to the table, you may miss some of the most important innovation possibilities for your business.
You need a cloud strategy that’s not just about the cloud. If you don’t yet have a cloud strategy, you’re behind the curve. But if your cloud strategy is limited to the cloud, you may be about to fall behind the next curve. Microsoft’s announcement is further proof that the cloud is crucial to the enterprise—and also part of a larger environment, including both legacy and edge platforms, with which it must integrate.
Be prepared for the ethics issues. Putting cognitive services in places you couldn’t before could raise new ethics issues. After all, we’re talking about the ability to read people’s expressions and even their emotions. This shouldn’t put you off—but it should put you on alert. Plug your ethics committee into these discussions when appropriate. If you don’t already have an ethics committee, create one. But that’s another post. :)

Want to learn more?

Microsoft’s announcement furthers the democratization of AI: the use of AI in more places and in more ways throughout the enterprise and beyond. Whether you turn to us for your AI solutions or look to us to assist you in developing your own, we’re ready to help with the greatest concentration of Microsoft expertise outside of Microsoft itself.

Roots of #AI

Mon, 12 Nov 2018 00:00:00 +0000

The naming is unfortunate when talking about #AI. There isn’t anything about intelligence - not as we humans know of it. If we can rewind back to the 50’s we can perhaps rename it to something like Computational Intelligence, which is more accurate. And although I have outlined the difference between some of the elements of AI in the past, I wanted to get back to what the intent was and how this area started.

Can machines think? Some say, the origins of #AI go back to Turing and started with his paper “ Computing machinery and intelligence (PDF) ” when it was published in 1950.Whilst, Turing might have planed the seed, it was a program called Logic Theorist created Allen Newell, Cliff Shaw, and Herbert Simon which was the first #ArtificialIntelligence program. Of course it wasn’t called #AI then.

That started back in 1956 when a Logic Theorist was presented at a conference in Dartmouth College called “ Dartmouth Summer Research Project on Artificial Intelligence (DSRPAI) (PDF) ”. The term “#AI” was coined at the conference.

Since then, AI has had a roller coaster of a ride over the decades - from colder than hell (I presume) winters, to hotter than lava with it being everywhere. As someone said, time will heal all wounds.

AI Timeline

Today, many of us use #AI, #DeepLearning, and, #MachineLearning interchangeably. Over the course of last couple of years, I have learned to ignore that, but fundamentally the distinction is important.

AI, we would say is more computational intelligence - allowing computers to do tasks that would be difficult for humans to do, certainly at scale. And these tasks are accomplished using different mechanisms and techniques, using “intelligent agents”.

Machine learning is a subset of AI, where the program or algorithm can learn from previous outputs, and improve based on that data - hence the “learning” part. It is akin to it learning from experience, but isn’t the same thing as we humans can comprehend and understand. Some of us think, the program is rewriting itself, which technically isn’t an accurate description.

Deep Learning is a set of techniques and algorithms of machine learning that are inspired from how the neurals in our brain connect together and work. These set of techniques are also called Neural Networks, and essentially are nothing but type of machine learning

For any of this AI “magic” to work, the one thing it needs to feed on is data. Without data, none of this would be possible. This data is classified into two categories - features and labels.

Features - these are aspects of whatever we are interested in. For example if we are interested in vehicles features could be the colour, make, and, model of the vehicle.
Labels - these are buckets of categories we put the things we are interested in. Using the same vehicles examples, we can have labels such as SUV, Sedan, Sports Car, Trucks, etc. that categorize vehicles.

One key principle to remember when it comes to #AI - all the outcomes that are described are in the terms of probabilities and not absolutes. All it suggests is the likelihood of something to happen, and most things cannot be predicted with total certainty. And this fundamental aspect one should remember when making decisions.

There isn’t a universal definition of AI, which sometimes doesn’t help. Each has their own perception. I have gotten over it to come to their terms and ensure we are talking the same lingo and meaning. It doesn’t help to get academic about it. :)

For example taking three leading analysts (Gartner, IDC, and Forrester) definition of AI (outlined below) is a good indicator on how this can get confusing.

Gartner - At its core, AI is about solving business problems in novel ways. It stretches across any organization from innovation, R&D and IT to data science.
IDC defines cognitive/Artificial Intelligence (AI) systems as a set of technologies that use deep natural language processing and understanding to answer questions and provide recommendations and direction. IDC’s coverage of cognitive/AI systems examines:
- Digital assistants
- Automated advisors
- Artificial intelligence, deep learning and machine learning
- Automated recommendation systems
Forrester defines AI as a liberatory technology at its core, and businesses that integrate it will free workers to become more innovative, creative, and adaptive than ever before. But these technologies are still in early stages.

And the field is just exploding now - not just with new research around #DeepLearning or #MachineLearning, but also net new aspects from a business perspectives; things like:

Digital Ethics
Conversational AI
Democratization of AI
Data Engineering (OK, not new, but certainly key)
Model Management
RPA (or #IntelligentAutomation)
AI Strategy

It is a new and exciting world that spans multiple spectrum. Don’t try and drink from the fire-hose, but take it in slowly, appreciate the nuances and what one brings value and discuss in terms of outcomes.

#ML concepts - Regularization, a primer

Sat, 29 Sep 2018 00:00:00 +0000

Regularization is a fundamental concept in Machine Learning (#ML) and is generally used with activation functions . It is the key technique that help with overfitting.

Overfitting is when an algorithm or model ‘fits’ the training data too well - it seems to good to be true. Essentially overfitting is when a model being trained, learns the noise in the data instead of ignoring it. If we allow overfitting, then the network only uses (or is more heavily influenced) by a subset of the input (the larger peaks), and doesn’t factor in all the input.

The worry there being that outside of the training data, it might not work as well for ‘real world’ data. For example the model represented by the green line in the image below (credit: Wikipedia), follows the sample data too closely and seems too good. On the other hand, the model represented by the black line, which is better.

Overfitting

Regularization helps with overfitting (artificially) penalizing the weights in the neural network. These weights are represented as peaks, and this reduces the peaks in the data. This ensure that the higher weights (peaks) don’t overshadow the rest of the data, and hence getting it to overfit. This diffusion of the weight vectors is sometimes also called weight decay.

Although there are a few regularization techniques for preventing overfitting (outlined below), these days in Deep Learning, L1 and L2 regression techniques are more favored over the others.

Cross validation: This is a method for finding the best hyper parameters for a model. E.g. in a gradient descent, this would be to figure out the stopping criteria. There are various ways to do this such as the holdout method, k-fold cross validation, leave-out cross validation, etc.
Step-wise regression: This method essentially is a serial step-by-step regression where one reduces the weakest variable. Step-wise regression essentially does multiple regression a number of times, each time removing the weakest correlated variable. At the end you are left with the variables that explain the distribution best. The only requirements are that the data is normally distributed, and that there is no correlation between the independent variables.
L1 regularization: In this method, we modify the cost function by adding the sum of the absolute values of the weights as the penalty (in the cost function). In L1 regularization the weights shrinks by a constant amount towards zero. L1 regularization is also called Lasso regression.
L2 regularization: In L2 regularization on the other hand, we re-scale the weight to a subset factor - it shrinks by an amount that is proportional to the weight (as outlined in the image below). This shrinking makes the weight smaller and is also sometimes called weight decay. To get this shrinking proportional, we take a squared mean of the weights , instead of the sum. At face value it might seem that the weight eventually get to zero, but that is not true; typically other terms cause the weights to increase. L2 regularization is also called Ridge regression.
Max-norm: This enforces a upper bound on the magnitude of the weight vector. The one area this helps is that a network cannot ’explode’ when the learning rates gets very high, as it is bounded. This is also called projected gradient descent.
Dropout : Is very simple, and efficient and is used in conjunction with one of the previous techniques. Essentially it adds a probably on the neuron to keep it active, or ‘dropout’ by setting it to zero. Dropout doesn’t modify the cost function; it modifies the network itself as shown in the image below.
Increase training data: Whilst one can artificially expand the training set theoretically possible, in reality won’t work in most cases, especially in more complex networks. And in some cases one might think also to artificially expand the dataset, typically it is not cost effective to get a representative dataset.

L1 Regularization

L2 Regularization

Dropout

Between L1 and L2 regularization, many say that L2 is preferred, but I think it depends on the problem statement. Say in a network, if a weight has a large magnitude, L2 regularization shrink the weight more than L1 and will better. Conversely, if the weight is small then L1 shrinks the weight more than L2 - and is better as it tends to concentrate the weight in fewer but more important connections in the network.

In closing, the key aspect to appreciate - the small weights (peaks) in a regularized network essentially means that as our input changes randomly (i.e. noise), it doesn’t have a huge impact to the network and its output. So this makes it difficult for the network to learn the noise and respond to that. Conversely, in an unregularized networks, that has higher weights (peaks), small random changes to those weights can have a larger impact to the behavior of the network and the information it carries.

Neural Network - Cheat Sheet

Tue, 11 Sep 2018 00:00:00 +0000

Neural Networks, today, help in a great set of tasks, that until very recently wasn’t possible at all - be it from computer vision, to medical diagnosis, to speech translation and forms a key cornerstone to a lot of ‘magic’ that Machine Learning and AI offers today.

I did blog about Neural Network types (and MarI/O) sometime back ; I surely cannot take credit for creating these three cheat sheets but they are awesome and hope you get to use and enjoy them too.

The merits of #AI

Mon, 02 Jul 2018 00:00:00 +0000

Thought of the week:

Artificial Intelligence stands no chance against natural Stupidity.

#ArtificalIntelligence

#ML training data

Fri, 15 Jun 2018 00:00:00 +0000

Seem like my training data for the car - perhaps a hint of #bias. 😂

#GeekyJokes #ML #AIJokes

Neural network basics & Activation functions

Tue, 12 Jun 2018 00:00:00 +0000

Neural networks have a very interesting aspect – they can be viewed as a simple mathematical model that defines a function. For a given function $f(x)$ which can take any input value of $x$, there will be some kind a neural network satisfying that function. This hypothesis was proven almost 20 years ago (“ Approximation by Superpositions of a Sigmoidal Function ” and “ Multilayer feedforward networks are universal approximators ”) and forms the basis of much of #AI and #ML use cases possible .

It is this aspect of neural networks that allow us to map any process and generate a corresponding function. Unlike a function in Computer Science, this function isn’t deterministic; instead is confidence score of an approximation (i.e. a probability). The more layers in a neural network, the better this approximation will be.

In a neural network, typically there is one input layer, one output layer, and one or more layers in the middle. To the external system, only the input layer (values of $x$), and the final output (output of the function $f(x)$) is visible, and the layers in the middle are not and are essentially hidden.

Each layer contains nodes, which are modeled after how the neurons in the brain works. The output of each node gets propagated along to the next layer. This output is the defining character of the node, and activates the node to pass on its value to the next node; this is very similar to how a neuron in the brain fires and works passing on the signal to the next neuron.

Neural Network

To make this generalization of function $f(x)$ outlined above to hold, that function needs to be a continuous function . A continuous function is one where small changes to the input value $x$, create small changes to the output of $f(x)$. If these outputs, are not small and the value jumps a lot then it is not continuous and it is difficult for the function to achieve the approximation required for them to be used in a neural network.

For a neural network to ‘learn’ – the network essentially has to use different weights and biases that has a corresponding change to the output, and possibly closer to the result we desire. Ideally, small changes to these weights and biases correspond to small changes in the output of the function. But one isn’t sure, until we train and test the result, to see that small changes don’t have bigger shifts that drastically move away from the desired result. It isn’t uncommon to see that one aspect of the result has improved, but others have not and overall skew the results.

In simple terms, an activation function is a node that is attached to the output of a neural network and maps the resulting value between 0 and 1. It is also used to connect two neural networks.

An activation function can be linear, or non-linear. A linear isn’t effective as its range is infinite. A non-linear with a finite range is more useful as it can be mapped as a curve, and then changes on this curve can be used to calculate the difference in the curve between two points.

There are many times of activation functions, each either its strengths. In this post, we discuss the following six:

Sigmoid
Tanh
ReLU
Leaky ReLU
ELU
Maxout

1. Sigmoid function

A sigmoid function can map any of input values into a probability – i.e., a value between 0 and 1. A sigmoid function is typically shown using a sigma ($\sigma$). Some also call the ($\sigma$) a logistic function. For any given input value, $ x $ the official definition of the sigmoid function is as follows:

$$\sigma(x) \equiv \frac{1}{1+e^{-x}}$$

If our inputs are $x_1, x_2,\ldots$, and their corresponding weights are $w_1, w_2,\ldots$, and a bias b, then the previous sigmoid definition is updated as follows:

$$\frac{1}{1+\exp(-\sum_j w_j x_j-b)}$$

When plotted, the sigmoid function will look plotted looks like this curve below. When we use this, in a neural network, we essentially end up with a smoothed-out function, unlike a binary function (also called a step function) – that is either 0, or 1.

For a given function, $f(x)$, as $x \rightarrow \infty$, $f(x)$ tends towards 1. And, as as $x \rightarrow -\infty$, $f(x)$ tends towards 0.

Sigmoid function

And this smoothness of $\sigma$ is what will create the small changes in the output that we desire - where small changes to the weights ($\Delta w_j$), and small changes to the bias ($\Delta b$) will produce small changes to the output ($\Delta output$).

Fundamentally, changing these weights and biases, is what can give us either a step function or small changes. We can show this as follows:

$$\Delta output \approx \sum_j (\frac{\partial \, output}{\partial w_j} \Delta w_j + \frac{\partial \, output}{\partial b} \Delta b)$$

One thing to be aware of is that the sigmoid function suffers from the vanishing gradient problem – the convergence between the various layers is very slow after a certain point – the neurons in previous layers don’t learn fast enough and are much slower than the neurons in later layers. Because of this, generally, a sigmoid is avoided.

2. Tanh (hyperbolic tangent function)

Tanh, is a variant of the sigmoid function, but still quite similar – it is a rescaled version and ranges from –1 to 1, instead of 0 and 1. As a result, its optimization is easier and is preferred over the sigmoid function. The formula for tanh is:

$$\tanh(x) \equiv \frac{e^x-e^{-z}}{e^X+e^{-x}}$$

Using, this we can show that:

$$\sigma(x) = \frac{1 + \tanh(x/2)}{2}$$.

Sigmoid vs Tanh

Tanh also suffers from the vanishing gradient problem. Both Tanh, and, Sigmoid are used in FNN (Feedforward neural network) – i.e. the information always moves forward and there isn’t any backprop.

FNN

3. Rectified Linear Unit (ReLU)

A rectified linear unity ( ReLU ) is the most popular activation function that is used these days.

$$\sigma(x) = \begin{cases} x & x > 0\\ 0 & x \leq 0 \end{cases}$$

ReLU’s are quite popular for a couple of reasons – one, from a computational perspective, these are more efficient and simpler to execute - there isn’t any exponential operations to perform. And two, these don’t suffer from the vanishing gradient problem.

ReLU

The one limitation ReLU’s have, is that their output isn’t in the probability space (i.e. can be >1), and can’t be used in the output layer.

As a result, when we use ReLU’s, we have to use a softmax function in the output layer. The output of a softmax function sums up to 1, and we can map the output as a probability distribution.

$$\sum_j a^L_j = \frac{\sum_j e^{z^L_j}}{\sum_k e^{z^L_k}} = 1.$$

Another issue that can affect ReLU’s is something called a dead neuron problem (also called a dying ReLU). This can happen when in the training dataset, some features have a negative value. When the ReLU is applied, those negative values become zero (as per the definition). If this happens at a large enough scale, the gradient will always be zero – and that node is never adjusted again (it is biased. and, weights never get changed) - essentially making it dead! The solution? Use a variation of the ReLU called a Leaky ReLU.

4. Leaky ReLU

A Leaky ReLU will usually allow a small slope $\alpha$ on the negative side; i.e that the value isn’t changed to zero, but rather something like 0.01. You can probably see the ‘leak’ in the image below. This ‘leak’ helps increase the range and we never get into the dying ReLU issue.

image

5. Exponential Linear Unit (ELU)

Sometimes a ReLU isn’t fast enough – over time, a ReLU’s mean output isn’t zero and this positive mean can add a bias for the next layer in the neural network; all this bias adds up and can slow the learning.

Exponential Linear Unit (ELU) can address this, by using an exponential function, which ensures that the mean activation is closer to zero. What this means, is that for a positive value, an ELU acts more like a ReLU and for the negative value it is bounded to -1 for $\alpha = 1$ – which puts the mean activation closer to zero.

$$\sigma(x) = \begin{cases} x & x \geqslant 0\\ \alpha (e^x - 1) & x < 0\end{cases}$$

When learning, this derivation of the slope is what is fed back (backprop) – so for this to be efficient, both the function and its derivative need to have a lower computation cost.

ELU vs ReLU

And finally, there is another variation that combines with ReLU and a Leaky ReLU called a Maxout function.

So, how do I pick one?

Choosing the ‘right’ activation function would of course depend on the data and problem at hand. My suggestion is to default to a ReLU as a starting step and remember ReLU’s are applied to hidden layers only. Use a simple dataset and see how that performs. If you see dead neurons than use a leaky ReLU or Maxout instead. It won’t make sense to use Sigmoid or Tanh these days for deep learning models but are useful for classifiers.

In summary, activation functions are a key aspect that fundamentally influences a neural network’s behavior and output. Having an appreciation and understanding of some of the functions is key to any successful ML implementation.

Netron - deep learning and machine learning model visualizer

Mon, 11 Jun 2018 00:00:00 +0000

I was looking at something else and happen to stumble across something called Netron , which is a model visualizer for #ML and #DeepLearning models. It is certainly much nicer than for anything else I have seen. The main thing that stood out for me, was that it supports ONNX , and a whole bunch of other formats (Keras, CoreML), TensorFlow (including Lite and JS), Caffe, Caffe2, and MXNet. How awesome is that?

This is essentially a cross platform PWA (progressive web app), essentially using Electron (JavaScript, HTML5, CSS) – which means it can run on most platforms and run-times from just a browser, Linux, Windows, etc. To debug it, best to use Visual Studio Code , along with the Chrome debugger extension .

Below is a couple of examples, of visualizing a ResNet-50 model – you can see both the start and the end of the visualization shown in the two images below to get a feel of things.

Start of ResNet-50 Model

End of ResNet-5o model

And some of the complex model seem very interesting. Here is an example of a TensorFlow Inception (v3) model.

And of course, this can get very complex (below is the same model, just zoomed out more).

I do think it is a brilliant, tool to help understand the flow of things, and what can one do to optimize, or fix. Also very helpful for folks who are just starting to learn and appreciate the nuances.

The code is released under an MIT license and you can download it here .

Machine learning use-cases

Tue, 05 Jun 2018 00:00:00 +0000

Someone recently asked me, what are some of the use cases / examples of machine learning. Whilst, this might seem as an obvious aspect to some of us, it isn’t the case for many businesses and enterprises – despite that they uses elements of #ML (and #AI) in their daily life – as a consumer.

Whilst, the discussion gets more interesting based on the specific domain and the possibly use cases (of course understanding that some might not be sure f the use case – hence the question in the first place). But, this did get me thinking and wanted to share one of the images we use internally as part of our training that outcomes some of the use cases.

Machine Learning Use Cases

These are not 1:1 and many of them can be combined together to address various use cases – for example a #IoT device sending in a sensor data, that triggers a boundary condition (via a #RulesEngine), that in addition to executing one or more business rule, can trigger a alert to a human-in-the-loop (#AugmentingWorkforce) via a #DigitalAssistant (say #Cortana) to make her/him aware, or confirm some corrective action and the likes. The possibilities are endless – but each of these elements triggered by AI/ML and still narrow cases and need to be thought of in the holistic picture.

Synthetic Sound

Sun, 03 Jun 2018 00:00:00 +0000

Trained a model to create a synthetic sound that sounds like me. This is after training it with about 30 sentences - which isn’t a lot.

To create a synthetic voice, you enter some text, using which is then “transcribed” using #AI and your synthetic voice is generated. In my case, at first, I had said AI, which was generated also as “aeey” (you can listen here ). So for the next one, changed the AI to Artificial Intelligence.

Your browser does not support the audio element.

One does need to be mindful of #DigitalEthics, as this technology improves further. This is with only a very small sampling of data. Imagine what could happen, with public figures - where their recordings are available quite easily in the public domain. I am thinking the ‘digital twang’ is one of the signatures and ways to stamp this as a generated sound.

My self-driving car

Thu, 31 May 2018 00:00:00 +0000

Over the last few weeks, I built a self-driving car - which essentially is a remote control Rx car that uses a raspberry pi running Python, TensorFlow implementing a end-to-end convolution neural network (CNN)

Of course other than being a bit geeky, I do think this is very cool to help understand and get into some of the basic constructs and mechanics around a number of things - web page design, hardware (maker things), and Artificial Intelligence principles.

There are two different models here - they do use the same ASC and controller that can be programmed. My 3D printer, did mess up a little (my supports were a little off) and which is why you see the top not clean.

The sensor and camera are quite basic, and there is provisions to add and do better over time. The Pi isn’t powerful enough to train the model - you need another machine for that (preferably a I7 core with a GPU). Once trained you can run the model on the Pi for inference.

This is the second car, which is a little different hardware, but the ESC to control the motor and actuators are the same.

The code is simple enough; below is an example of the camera (attached) to the Pi, saving the images it is seeing. Tubs is the location where the images are saved; these can then be transferred to another machine for training or inference.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import donkey as dk

#initialize the vehicle
V = dk.Vehicle()

#add a camera part
cam = dk.parts.PiCamera() V.add(cam, outputs=\['image'\], threaded=True)

#add tub part to record images
tub = dk.parts.Tub(path='~/d2/gettings\_started', inputs=\['image'\], types=\['image\_array'\]) V.add(tub, inputs=inputs)

#start the vehicle's drive loop
V.start(max\_loop\_count=100)

Below you can see the car driving itself around the track, where it had to be trained first. The reason it is not driving perfectly is because during training (when I was manually driving it around), I crashed a few times and as a result the training data was messed up. Needed more time to clean that up and retrain it.

There should have been a video here but your browser does not seem to support it.

This is based on donkey car - which is an open source DIY for platform for small-scale self driving cars. I think it is also perfect to get into with those who have teenagers and a little older kids to get in and experiment. You can read up more details on how to go about building this, and the parts needed here .

AI photo and style transfer

Tue, 22 May 2018 00:00:00 +0000

Can #AI make me look (more) presentable? The jury is out I think.

This is called style transfer, where the style/technique from a kind of painting (could be a photos too) is applied to an image, to create a new image. I took this using the built-in camera on my machine sitting at my desk and then applying the different kind of ‘styles’ on it. Each of these styles are is a separate #deeplearning model that has learned how to apply the relevant style to a source image.

Style - Candy

Style - Feathers

Style - Mosaic

Style - Robert

Specifically, this uses a Neural Network (#DeepLearning) model called VGG19 , which is a 19-layer model running on TensorFlow. Of course, you can export this to a ONNX model, that then can be used in most other run-times and libraries.

This is inspired from Cornell universities paper - Perceptual Losses for Real-Time Style Transfer and Super-Resolution . Below is a snapshot of the VGG code that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


def net(data_path, input_image): layers = ( 
  'conv1_1', 'relu1_1', 'conv1_2', 'relu1_2', 'pool1', 
  'conv2_1', 'relu2_1', 'conv2_2', 'relu2_2', 'pool2', 
  'conv3_1', 'relu3_1', 'conv3_2', 'relu3_2', 'conv3_3', 
  'relu3_3', 'conv3_4', 'relu3_4', 'pool3', 'conv4_1', 
  'relu4_1', 'conv4_2', 'relu4_2', 'conv4_3', 'relu4_3', 
  'conv4_4', 'relu4_4', 'pool4', 'conv5_1', 
  'relu5_1', 'conv5_2', 'relu5_2', 'conv5_3', 'relu5_3', 'conv5_4', 'relu5_4'
  )
  
data = scipy.io.loadmat(data_path)
mean = data['normalization'][0][0][0]
mean_pixel = np.mean(mean, axis=(0, 1)) 
weights = data['layers'][0]

net = {} 
current = input_image 

for i, name in enumerate(layers):
  kind = name
  if kind == 'conv':
    kernels, bias = weights[i][0][0][0][0] 
    # matconvnet: weights are [width, height, in_channels, out_channels] 
    # tensorflow: weights are \[height, width, in_channels, out_channels\] 
    kernels = np.transpose(kernels, (1, 0, 2, 3)) 
    bias = bias.reshape(-1) 
    current = _conv_layer(current, kernels, bias) 
  elif kind == 'relu': 
    current = tf.nn.relu(current) 
  elif kind == 'pool': 
    current = _pool_layer(current) 
    net[name] = current

assert len(net) == len(layers) return net

def _conv_layer(input, weights, bias): conv = tf.nn.conv2d(input, tf.constant(weights), strides=(1, 1, 1, 1), padding='SAME') return tf.nn.bias_add(conv, bias)

def _pool_layer(input): return tf.nn.max_pool(input, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1), padding='SAME')

If you want to play with this, you can download the code . Personally, I like the Mosaic style the best.

Artificial Intelligence (AI)

Thu, 01 Mar 2018 00:00:00 +0000

Yesterday it worked Today it is not working #AI is like that

#Haiku #GeekyHaiku #GeekyJokes

DARPA's perspective on AI

Wed, 11 Oct 2017 00:00:00 +0000

One of the challenges we have with AI is that there isn’t any universal definition - it is a broad category that means everything to everyone. Debating the rights, and, the wrongs, and the should’s and the shouldn’t s is another post though.

DARPA outlines this as the “programmed ability to process information” and across a certain set of criteria that span across perceiving, learning, abstracting, and, reasoning.

AI Scale Intelligence

They classify AI in three waves - out outlined below. Each of these is at a different level across the intelligence scale. I believe it is important to have a scale such as this - it will help temper expectations and compare apples to apples; and for enterprises it will help create roadmaps on outcomes and their implementations; and finally help cut through the hype cycle noise that AI has generated.

Wave 1 - Handcrafted Knowledge

The first wave operates on a very narrow problem area (the domain) and essentially has no (self)learning capability. The key area to understand that the machine can explore specifics, based on the knowledge and related taxonomy/ structure which is defined by humans. We create a set of rules to represent the knowledge in a well-defined domain.

Of course as the Autonomous grand challenge taught us - it cannot handle uncertainty.

AI First wave stumbles

Wave 2 - Statistical Learning

The second wave, has better classification and prediction capabilities - a lot of which is via statistical learning. Essentially problems in certain domains are solved by statistical models - which are training on big data. It still doesn’t have contextual ability and has minimal reasoning ability.

A lot of what we are seeing today is related to this second wave; and one of the hypothesis holding this up is called manifold hypothesis. This essentially states that high dimension data (e.g. images, speech, etc.) tends to be in the vicinity of low dimension manifolds.

A manifold is an abstract mathematical space which, in a close-up view, resembles the spaces described by Euclidean geometry. Think of it as a set of points satisfying certain relationships, expressible in terms of distance and angle. Each manifold represents a different entity and the understanding of the data comes by separating the manifolds.

Using handwriting digits as an example - each image is one element in a set which has 784 dimensions, which form a number of different manifolds.

Handwritten digits

Manifolds of handwriting

Separating each of these manifolds (by stretching and squishing of data) to get them isolated is what makes the layers in a Neural net work. Each layer in the neural network computes its output from the preceding layer of inputs (implemented usually by a non-linear function) - learning from the data.

AI Neural Nets

AI Neural Nets learning from data

So, in statistical learning, one would design and program the network structure based on experience. Here is an example of how the number 2 to be recognized goes through the various feature maps.

AI Structural neural network

And one can combine and layer the various kinds of neural networks together (e.g. a CNN + RNN).

Layering neural networks

And whilst it is statistically impressive, it is also individually unreliable.

AI failure

Wave 3 - Contextual Adaptation

The future on AI, is what DARPA is calling Contextual adaptation - where models explain their decisions, which is then used to drive further decisions. Essentially one ends up in this world where we construct contextual explanatory models that are reflective of real world situations.

AI Models to explain decisions

AI Models to drive decisions

In summary, we are in the midst of Wave 2 - which is already very exciting. For an enterprise, it is key to have a scale that outlines the ability to process information across the intelligence scale to help make this AI revolution more tangible and manageable.

First Wave of AI - Handcraft Knowledge

Second Wave of AI - Statistical Learning

Third Wave of AI - Contextual adaption

Third Wave of AI - Contextual adaptions

PS - if you want to read up more on manifold hypothesis and how they play in neural networks, I would suggest reading Chris’s blog post .

Cognitive Bias

Thu, 28 Sep 2017 00:00:00 +0000

Cognitive Bias

Machine Learning basics

Sun, 04 Jun 2017 00:00:00 +0000

Thinking about #machinelearning? It will be helpful to understand some numerical computations and concepts that affect the #ML algorithm.

One might not interact with these directly, but we surely can feel the effect. The things you need to think about are:

1. Overflow and underflow - thinking of them as rounding up or down errors that shift the functions enough, and compounded across the iterations cam be devastating. Of course can also easily get to division by zero.

2. Poor conditioning - essentially with small changes of input data, how large can the output move. You want this small. (And in cryptography you want the opposite, and large).

3. Gradient optimizations - there will be some optimization happening in the algorithm, question is how does it handle various local points on the curve? Local minimum, saddle points, and local maximum. Generally speaking, it’s about optimizing continuous spaces.

Some algorithms take this a step further by measuring a second derivative (think of it as measuring the derivative of a derivative - the curvature of a function).

4. Constrained Optimization - sometimes we just want to operate on a subset - so constraints only on that set.

All of these come into play some way, directly or indirectly and having a basic understanding and constraints around this would help a long way.

Whats the difference between #AI, #ML, and #DeepLearning?

Thu, 25 May 2017 00:00:00 +0000

I know I have had to explain this a lot in most #AI related conversations that I have had - and lately those have been quite a lot. In my experience, most people use these terms interchangeably when they are meaning one over the other.

Whilst they all are (inter)related and one might help trigger the other, they are still fundamentally different and at some point, it is good to understand the differences. I like the image below ( source ) that whilst on one hand is showing a time graph, the correlation between them and how one is a subset of the other is what is interesting.

#AI vs #ML vs #DNN

#AI is getting more powerful and the potential of it which personally really excites me is the paradigm shift we are starting to see. Fundamentally it is changing on how we use, interact, and, value computers and technology.

It is shifting from us learning machines and their idiosyncrasies (remember when being computer literate was a differentiator on a resume) to this shift where technology learns us and interacts with us in a more natural, and dare I say human manner.

AI paradigm shift

I almost see it as StarTrek (and now showing my age) - the computer is everywhere, yet it is no where. It is embedded and woven into everything we do on the Enterprise rather an some “thing” one interacts with.

And it is awesome to start seeing some of this coming to life, even if it is in a demo as outlined at Build a couple of weeks ago. #AI in the Workplace and how it interacts with objects in real-time and can invoke and interact Business workflow (such as workplace policies).

AI in Workplace

Policy violation

The degree of calculations is pretty phenomenal - 27 million / sec [separately I would love to understand the definition on calculation 😄]. But then given where we are heading with a fully autonomous car generating about 100GB of data each second, this isn’t small potatoes.

And whilst you can read up more on these terms and how they link , I really like to move away from the different terms which most people confuse in the first place and start thinking of more business outcomes and how enterprises and people will use.

To that end, the three buckets of Intelligent Automation, Robotic Process Automation (RPA), and Physical Automation is what we have found work better. On RPA, the one caveat being that it is not about robots, but rather the automation of a (business) process. The robots aspect would fall under physical automation - which essentially is anything that interacts with the real/physical world.

Core principle of Machine Learning

Tue, 04 Apr 2017 00:00:00 +0000

There of course are many, but for someone coming from computer science, and, software engineering, where the environment is relatively clean and certain (deterministic), it usually is a leap to understand that Machine Learning (and other elements of #AI) are not.

Machine learning, is based on probability theory and deals with stochastic (non-deterministic) elements all the time. Nearly all activities in machine learning, require the ability to factor and more importantly, represent and reason with uncertainty.

To that end, when designing a system, it is recommended to use a simple but uncertain (with some non-deterministic aspects) rule, rather than a complex but certain rule.

For example, having a simple but uncertain rule saying “most birds fly”, is easier and more effective than a certain rule such as “Birds can fly, except flightless species, or those who are sick, or babies, etc.”

As one starts getting deeper in Machine Learning, a trip down memory lane around Probability distribution , expectation, variance , and covariance won’t hurt.

HoloPortation - Limits of Human Kind

Fri, 17 Mar 2017 00:00:00 +0000

When it comes to AI and the limits of human kind, what better example that shows the art of the possible than what Microsoft is doing with special awareness and HoloLens and other sensors.

And not only can this replay time and allow you to have a ’living memory’ but it also is mobile.

I do believe we are living in the great time ever! :)

Neural Networks

Thu, 16 Mar 2017 00:00:00 +0000

Of course you heard of Neural Networks! In the context of #AI they are all the buzz of course.

You might have heard of some such as DFF (Deep Feed Forward) or RNN (Recurrent neural networks)? Or perhaps you meant Recursive neural networks? Irrespective, it can be quite messy as you can see below and it would be somewhat important to have some understanding of the differences.

And in case you are thinking, well what good or use is all this? Here is one example ( MarI/O - Machine Learning for Video Games) that shows how a computer learned to play Mario using DeepMind and a Neural network.

MarI/O uses something called NEAT (neural evolution of augmenting topologies) and is written in Lua (which is very similar to .NET) and runs in BizHalk which is a emulator for games and their various platforms (and not to be confused with BizTalk). You can checkout the code for this here .

Fjodor also has outlined a (very) brief outline on what some of these are and what they mean. If you just want to get a quick basic understand it is a great read, with of course links back to original research papers (and deeper reads) if that is your cup of tea.

Happy reading! 😄

Object and scene detection with #AI

Fri, 02 Dec 2016 00:00:00 +0000

Continuing the previous #ArtificialIntelligence theme. Wanted to see what and how does Amazon’s rekognition work and different from the #AI offerings from the others, such as Microsoft.

Here is a #ProjectMurphy image’s confidence score. I am glad to see that there is a 99% confidence that this is a person.

Object and Scene detection

The request POST is quite simple:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


{
 "method": "POST",
 "path": "/",
 "region": "us-west-2",
 "headers": {
 "Content-Type": "application/x-amz-json-1.1",
 "X-Amz-Date": "Thu, 01 Dec 2016 22:21:01 GMT",
 "X-Amz-Target": "com.amazonaws.rekognitionservice.RekognitionService.DetectLabels"
 },
 "contentString": {
 "Attributes": [
 "ALL"
 ],
 "Image": {
 "Bytes": "..."
 }
 }
 }

And so is the response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64


{
 "Labels": [
 {
 "Confidence": 99.2780990600586,
 "Name": "People"
 },
 {
 "Confidence": 99.2780990600586,
 "Name": "Person"
 },
 {
 "Confidence": 99.27307891845703,
 "Name": "Human"
 },
 {
 "Confidence": 73.7669448852539,
 "Name": "Flyer"
 },
 {
 "Confidence": 73.7669448852539,
 "Name": "Poster"
 },
 {
 "Confidence": 68.23612213134765,
 "Name": "Art"
 },
 {
 "Confidence": 58.291263580322266,
 "Name": "Brochure"
 },
 {
 "Confidence": 55.91957092285156,
 "Name": "Modern Art"
 },
 {
 "Confidence": 53.9996223449707,
 "Name": "Blossom"
 },
 {
 "Confidence": 53.9996223449707,
 "Name": "Flora"
 },
 {
 "Confidence": 53.9996223449707,
 "Name": "Flower"
 },
 {
 "Confidence": 53.9996223449707,
 "Name": "Petal"
 },
 {
 "Confidence": 53.9996223449707,
 "Name": "Plant"
 },
 {
 "Confidence": 50.69965744018555,
 "Name": "Face"
 },
 {
 "Confidence": 50.69965744018555,
 "Name": "Selfie"
 }
 ]
}

Here is what the facial analysis shows;

Facial Analysis

However how does it handle something a little more complex perhaps?

Object and Scene detection

And finally, what of the comparison? I think there might be some more work to be done on that front.

Face Comparison capture

Here is the response:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


{
 "FaceMatches": [
 {
 "Face": {
 "BoundingBox": {
 "Height": 0.3878205120563507,
 "Left": 0.2371794879436493,
 "Top": 0.22435897588729858,
 "Width": 0.3878205120563507
 },
 "Confidence": 99.79533386230469
 },
 "Similarity": 0
 }
 ],
 "SourceImageFace": {
 "BoundingBox": {
 "Height": 0.209781214594841,
 "Left": 0.4188888967037201,
 "Top": 0.13127413392066955,
 "Width": 0.18111111223697662
 },
 "Confidence": 99.99442291259765
 }
}

Playing with #AI

Thu, 01 Dec 2016 00:00:00 +0000

So, been spending a lot of time recently around many things related to Artificial Intelligence (#AI). More on that some day. :)

Was curious about yesterdays Amazon’s announcement to jump on this bandwagon. Of course Microsoft and others have been there. I don’t know to what extend has Amazon been working on this, but given Alexa has been out for a couple of years, I know they have had rich pickings of tuning this further.

I thought Polly (like the parrot?) was quite different from the things I have seen from others. This is a text-to-speech, where it renders the inputted text into various dialects and you can have a few outputs for those too. It supports a few dialects (for the synthesized speech) and one can use it using a simple API (the Android example shows it is not very complex to consume, of course you still need to think about the overall design and elements of Software Engineering, latency, limits, bandwidth, etc.). Should you desire you can customize it using pronunciation Lexicons that allow one to tweak this.

Here are a few examples, none of them are me, and hence the “cold”.

Australian (Male):

Your browser does not support the audio element.

Indian (Female):

Your browser does not support the audio element.

Italian (Male):

Your browser does not support the audio element.

US/American (Male):

Your browser does not support the audio element.

Of course, if you play with it, it is easy to pick up the patterns and what is being changed, versus not. But kudos to the team on this. I think it will help accelerate the adoption of #AI.