Pankaj Verma

Pankaj Verma

Full Stack Developer

December 14, 20233 min read1322

Unleashing the Power of Multimodal Intelligence: How to Use Gemini-Pro-Vision with LangChain


Google's AI arms race just got hotter! On December 13th, just a week after unveiling its AI behemoth Gemini, the tech giant unleashed its enterprise-ready sibling, Gemini Pro. This powerful language model empowers businesses and developers globally to build the future of AI applications.

In this blog we will use langChain and Google Gemini models for deep drive into it.

What is Gemini-Pro-Vision and LangChain?

  • Gemini-Pro-Vision: A multimodal monster, seamlessly processing text and images to understand the world like never before. Think of it as a super-powered AI detective, deciphering clues from both written and visual sources.
  • LangChain: The ultimate AI playground, fostering collaboration and building context-aware applications. Imagine a vibrant community of AI models, working together to learn and reason from diverse data sources.

Why use them together?

Combining these two titans creates a symphony of intelligence:

  • Go beyond text: Break free from the limitations of purely text-based AI. With Gemini-Pro-Vision, LangChain can now analyze images, graphs, and other visual data, enriching its understanding of the world.
  • Reason across modalities: Unravel complex connections between text and visuals. LangChain can now use Gemini-Pro-Vision's insights to make inferences and draw conclusions based on both written and visual information.
  • Unlock new applications: The possibilities are endless! Build applications that answer questions based on images and text, generate creative content inspired by visuals, or even develop AI assistants that understand the world around them.

Now, let's delve into the provided code snippet, where we'll witness LangChain and Gemini-Pro-Vision in action:

import { HumanMessage } from "@langchain/core/messages";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import fs from "fs";
const model = new ChatGoogleGenerativeAI({
  modelName: "gemini-pro-vision",
  maxOutputTokens: 2048,


async function run() {
  try {
    const image = fs.readFileSync("./Cricket_Stadium.jpg").toString("base64");
    const input2 = [
      new HumanMessage({
        content: [
            type: "text",
            text: "Describe the following image.",
            type: "image_url",
            image_url: `data:image/png;base64,${image}`,

    const res = await model.invoke(input2);
  } catch (err) {

  • Importing the Tools: We import necessary libraries from LangChain's core and Google-GenAI modules, setting the stage for multimodal interactions.
  • Initializing the Model: We create a ChatGoogleGenerativeAI object, specifying the "gemini-pro-vision" model and desired output length.
  • Reading the Image: We use fs to read a cricket stadium image and convert it to base64 encoding, preparing it for Gemini-Pro-Vision's consumption.
  • Crafting the Input: We create a HumanMessage object containing two content elements: A text message requesting image description.  An image_url element holding the base64-encoded image data.
  • Invoking the Model: We call the model's invoke method with the HumanMessage array, sending our image and description request. 
  • Capturing the Results: The model analyzes the image and text, returning its insights as a response object. This response holds the key to unlocking the image's secrets!

Embrace the Multimodal 

By combining the power of Gemini-Pro-Vision and LangChain, developers can unlock a new era of AI interaction. This is a call to action, an invitation to explore the boundless possibilities of multimodal intelligence. So, grab your curiosity, embrace the code, and start building the future with Gemini-Pro-Vision and LangChain! 

Remember, this is just the beginning. As you dive deeper into these tools, you'll uncover endless possibilities and contribute to the evolution of intelligent systems that understand and interact with the world around them, both visually and textually.

AI + webdev = 🪄