PULSEPLAY_DIGITAL_LOGO

Creative,
Branding
& Growth
in the Global
Digital Ecosystem

Work

Case Studies
Portfolio

Clients
Testimonials

Services

Performance Marketing

Creative Story Telling & Production

Conversion Rate Optimisation

Digital Strategy & Consulting

Brand Identity & Launch

Audience Intelligence

E-commerce & Websites

Mobile Apps

SEO & Content Marketing

Social Media Marketing

Video Production & Marketing

Digital PR & Communication

Influencer Marketing

Paid Media Advertising

Marketing Automation

Outsourced Managed Services

Technology & Platform Integration

Data Analytics

Solutions

Start Ups & Unicorns

SMB & MSME

Enterprise

Agriculture

Automotive

Beauty & Fashion

Consumer Products

D2C, E-commerce & Marketplace

FMCG & Retail

Health & Wellness

Sports & Fitness

Education

Real Estate

Entertainment & Celebrity

Travel, Tourism & Leisure

IT & ITES

Financial Services

Manufacturing

Social & Not For Profit

Public Sector

Consumer Apps

About

Brand
Identity

Functions
Imagine. Build. Design. Perform

Values
Manifest

Team
PulsePlayers

Advisors
Mentors

Digital Capabilities
Media. Technologies. Partnerships

Approach
Simplify. Apply. Modify. Re-Apply. Amplify. Fly

Journey
Memories

Ecosystem
#HimalayasDigital

Careers

Live & work in the lap of Himalayas

Why join our growing team

Apply

Refer

Media

Resources

Blog

Awards

News

Hire Us

Web Developers

Software Developers

Mobile Developers

Digital Marketers

Creative Designers

CRM / CX / CDP Consultants

Contact

Pankaj Verma

Pankaj Verma

Full Stack Developer

December 14, 20233 min read1322

Unleashing the Power of Multimodal Intelligence: How to Use Gemini-Pro-Vision with LangChain

/20231214-487tv-build-with-gemini

Google's AI arms race just got hotter! On December 13th, just a week after unveiling its AI behemoth Gemini, the tech giant unleashed its enterprise-ready sibling, Gemini Pro. This powerful language model empowers businesses and developers globally to build the future of AI applications.

In this blog we will use langChain and Google Gemini models for deep drive into it.

What is Gemini-Pro-Vision and LangChain?

Gemini-Pro-Vision: A multimodal monster, seamlessly processing text and images to understand the world like never before. Think of it as a super-powered AI detective, deciphering clues from both written and visual sources.
LangChain: The ultimate AI playground, fostering collaboration and building context-aware applications. Imagine a vibrant community of AI models, working together to learn and reason from diverse data sources.

Why use them together?

Combining these two titans creates a symphony of intelligence:

Go beyond text: Break free from the limitations of purely text-based AI. With Gemini-Pro-Vision, LangChain can now analyze images, graphs, and other visual data, enriching its understanding of the world.
Reason across modalities: Unravel complex connections between text and visuals. LangChain can now use Gemini-Pro-Vision's insights to make inferences and draw conclusions based on both written and visual information.
Unlock new applications: The possibilities are endless! Build applications that answer questions based on images and text, generate creative content inspired by visuals, or even develop AI assistants that understand the world around them.

Now, let's delve into the provided code snippet, where we'll witness LangChain and Gemini-Pro-Vision in action:

import { HumanMessage } from "@langchain/core/messages";
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import fs from "fs";
const model = new ChatGoogleGenerativeAI({
  modelName: "gemini-pro-vision",
  maxOutputTokens: 2048,

  apiKey: "XXXXXXXXXXXXXXXXX",
});

async function run() {
  try {
    const image = fs.readFileSync("./Cricket_Stadium.jpg").toString("base64");
    const input2 = [
      new HumanMessage({
        content: [
          {
            type: "text",
            text: "Describe the following image.",
          },
          {
            type: "image_url",
            image_url: `data:image/png;base64,${image}`,
          },
        ],
      }),
    ];

    const res = await model.invoke(input2);
    console.log(res);
  } catch (err) {
    console.log(err);
  }
}

Importing the Tools: We import necessary libraries from LangChain's core and Google-GenAI modules, setting the stage for multimodal interactions.
Initializing the Model: We create a ChatGoogleGenerativeAI object, specifying the "gemini-pro-vision" model and desired output length.
Reading the Image: We use fs to read a cricket stadium image and convert it to base64 encoding, preparing it for Gemini-Pro-Vision's consumption.

Crafting the Input: We create a HumanMessage object containing two content elements: A text message requesting image description. An image_url element holding the base64-encoded image data.
Invoking the Model: We call the model's invoke method with the HumanMessage array, sending our image and description request.
Capturing the Results: The model analyzes the image and text, returning its insights as a response object. This response holds the key to unlocking the image's secrets!

Embrace the Multimodal

By combining the power of Gemini-Pro-Vision and LangChain, developers can unlock a new era of AI interaction. This is a call to action, an invitation to explore the boundless possibilities of multimodal intelligence. So, grab your curiosity, embrace the code, and start building the future with Gemini-Pro-Vision and LangChain!

Remember, this is just the beginning. As you dive deeper into these tools, you'll uncover endless possibilities and contribute to the evolution of intelligent systems that understand and interact with the world around them, both visually and textually.

AI + webdev = 🪄

***