如何使用Google Gemini模型完成计算机视觉任务？

译者 | 李睿审校 | 重楼自从人工智能聊天机器人兴起以来，Google Gemini脱颖而出，已经成为推动智能系统进化的主要参与者之一。除了强大的会话能力之外，Gemini还释放了计算机视觉实际应用的潜力，让它们能够看到、解释和描述周围的世界。本文将逐步讲解如何利用Google Gemini完成计算机视觉任务，其中包括如何设置环境，发送带有指令的图像以及解释模型的输出以进行对象检测、字幕生成和OCR，还将探讨数据注释工具（例如YOLO中使用的工具），为自定义训练场景提供上下文。

译者 | 李睿

审校 | 重楼

自从人工智能聊天机器人兴起以来，Google Gemini脱颖而出，已经成为推动智能系统进化的主要参与者之一。除了强大的会话能力之外，Gemini还释放了计算机视觉实际应用的潜力，让它们能够看到、解释和描述周围的世界。

本文将逐步讲解如何利用Google Gemini完成计算机视觉任务，其中包括如何设置环境，发送带有指令的图像以及解释模型的输出以进行对象检测、字幕生成和OCR，还将探讨数据注释工具（例如YOLO中使用的工具），为自定义训练场景提供上下文。

Google Gemini简介

Google Gemini是一系列用于处理多种数据类型（例如文本、图像、音频和代码等）的人工智能模型，这意味着它可以处理涉及理解图片和文字的任务。

Gemini 2.5 Pro的关键特性

•多模态输入：在请求中接受文本和图像的组合。

•推理：该模型可以分析输入的信息，以执行识别物体或描述场景等任务。

•指令跟随：响应指导其分析图像的文本指令（提示）。

这些特性允许开发人员通过API将Google Gemini用于与视觉相关的任务，而无需为每个任务训练单独的模型。

数据注释的作用：YOLO注释器

尽管Gemini模型在计算机视觉任务中具备强大的零样本或小样本学习能力，但构建高度专业化的计算机视觉模型需要在针对特定问题量身定制的数据集上进行训练。这就是数据注释变得至关重要的地方，特别是对于像训练自定义对象检测器这样的监督学习任务。

YOLO注释器（通常指的是与YOLO格式兼容的工具，例如Labeling、CVAT或Roboflow）被设计用于创建标记数据集。

什么是数据注释？

如何使用Google Gemini模型完成计算机视觉任务？

图像来源：链接

对于对象检测，注释涉及在图像中每个感兴趣的对象周围绘制边界框，并分配类标签（例如“汽车”、“人”、“狗”）。这些注释数据告诉模型在训练期间要查找什么以及在哪里。

注释工具的主要特性（例如YOLO注释器）

用户界面：它们提供图形界面，允许用户加载图像，绘制框（或多边形，关键点等），并有效地分配标签。
格式兼容性：为YOLO模型设计的工具以YOLO训练脚本期望的特定文本文件格式保存注释（通常每个图像一个.txt文件，包含类索引和规范化边界框坐标）。
效率特性：许多工具包括热键、自动保存和模型辅助标记等特性，以加快通常耗时的注释过程。批处理允许更有效地处理大型图像集。
集成：使用像YOLO这样的标准格式确保注释数据可以轻松地与流行的训练框架（包括Ultralytics YOLO）一起使用。

虽然用于计算机视觉的Google Gemini可以在没有事先注释的情况下检测对象，但如果需要一个模型来检测具体的定制对象（例如独特类型的工业设备、特定的产品缺陷等），可能需要收集图像，并使用像YOLO注释器这样的工具对它们进行注释，以训练专用的YOLO模型。

代码实现——Google Gemin用于计算机视觉

首先，需要安装必要的软件库。

步骤1：安装先决条件

（1）安装库

在终端运行以下命令：

复制

!uv pip install -U -q google-genai ultralytics

该命令安装google-genai库，以便与Gemini API和ultralytics库通信，后者包含处理图像和在图像上绘图的有用功能。

（2）导入模块

将这些行添加到Python Notebook中：

复制

import json
import cv2
import ultralytics
from google import genai
from google.genai import types
from PIL import Image
from ultralytics.utils.downloads import safe_download
from ultralytics.utils.plotting import Annotator, colors
ultralytics.checks()

这段代码导入了用于读取图像（cv2、PIL）、处理JSON数据（JSON）、与API交互（google.generativeai）和实用程序函数（ultralytics）等任务的库。

（3）配置API密钥

使用Google AI API密钥初始化客户端。

首先，需要安装必要的软件库。

复制

# Replace "your_api_key" with your actual key
# Use GenerativeModel for newer versions of the library
# Initialize the Gemini client with your API key
client = genai.Client(api_key=”your_api_key”)

这一步骤准备脚本以发送经过身份验证的请求。

步骤2：与Gemini互动

创建一个向模型发送请求的函数。这个函数接受一个图像和一个文本提示，并返回模型的文本输出。

复制

def inference(image, prompt, temp=0.5):
 """
 Performs inference using Google Gemini 2.5 Pro Experimental model.
 Args:
 image (str or genai.types.Blob): The image input, either as a base64-encoded string or Blob object.
 prompt (str): A text prompt to guide the model's response.
 temp (float, optional): Sampling temperature for response randomness. Default is 0.5.
 Returns:
 str: The text response generated by the Gemini model based on the prompt and image.
 """
 response = client.models.generate_content(
 model="gemini-2.5-pro-exp-03-25",
 cnotallow=[prompt, image], # Provide both the text prompt and image as input
 cnotallow=types.GenerateContentConfig(
 temperature=temp, # Controls creativity vs. determinism in output
 ),
 )
 return response.text # Return the generated textual response

解释

（1）该函数将图像和文本指令（提示）发送到model_client中指定的Gemini模型。

（2）温度设置（温度）影响输出的随机性；值越低，结果越可预测。

步骤3：准备图像数据

在将图像发送到模型之前，需要正确加载图像。如果需要，该函数可以下载图像，读取图像，转换颜色格式，并返回PIL image对象及其尺寸。

复制

def read_image(filename):
 image_name = safe_download(filename)
 # Read image with opencv
 image = cv2.cvtColor(cv2.imread(f"/content/{image_name}"), cv2.COLOR_BGR2RGB)
 # Extract width and height
 h, w = image.shape[:2]
 # # Read the image using OpenCV and convert it into the PIL format
 return Image.fromarray(image), w, h

解释

（1）该函数使用OpenCV （cv2）读取图像文件。

（2）它将图像颜色顺序转换为RGB，这是标准的。

（3）它返回图像作为一个PIL对象，适合于推理函数，以及它的宽度和高度。

步骤4：结果格式化

复制

def clean_results(results):
 """Clean the results for visualization."""
 return results.strip().removeprefix("```json").removesuffix("```").strip()

该函数将结果格式化为JSON格式。

任务1：对象检测

Gemini可以在图像中找到对象，并根据文本指示报告其位置（边界框）。

复制

# Define the text prompt
prompt = """
Detect the 2d bounding boxes of objects in image.
"""
# Fixed, plotting function depends on this.
output_prompt = "Return just box_2d and labels, no additional text."
image, w, h = read_image("https://media-cldnry.s-nbcnews.com/image/upload/t_fit-1000w,f_auto,q_auto:best/newscms/2019_02/2706861/190107-messy-desk-stock-cs-910a.jpg") # Read img, extract width, height
results = inference(image, prompt + output_prompt) # Perform inference
cln_results = json.loads(clean_results(results)) # Clean results, list convert
annotator = Annotator(image) # initialize Ultralytics annotator
for idx, item in enumerate(cln_results):
 # By default, gemini model return output with y coordinates first.
 # Scale normalized box coordinates (0–1000) to image dimensions
 y1, x1, y2, x2 = item["box_2d"] # bbox post processing,
 y1 = y1 / 1000 * h
 x1 = x1 / 1000 * w
 y2 = y2 / 1000 * h
 x2 = x2 / 1000 * w
 if x1 > x2:
 x1, x2 = x2, x1 # Swap x-coordinates if needed
 if y1 > y2:
 y1, y2 = y2, y1 # Swap y-coordinates if needed
 annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))
Image.fromarray(annotator.result()) # display the output

输出

如何使用Google Gemini模型完成计算机视觉任务？

图像来源：链接

解释

（1）提示告诉模型要查找什么以及如何格式化输出（JSON）。

（2）它使用图像宽度(w)和高度(h)将归一化的边界框框坐标（0-1000）转换为像素坐标。

（3）注释器工具在图像的副本上绘制框和标签。

任务2：测试推理能力

使用Gemini模型，可以使用理解上下文并提供更精确结果的高级推理来处理复杂任务。

复制

# Define the text prompt
prompt = """
Detect the 2d bounding box around:
highlight the area of morning light +
PC on table
potted plant
coffee cup on table
"""
# Fixed, plotting function depends on this.
output_prompt = "Return just box_2d and labels, no additional text."
image, w, h = read_image("https://thumbs.dreamstime.com/b/modern-office-workspace-laptop-coffee-cup-cityscape-sunrise-sleek-desk-featuring-stationery-organized-neatly-city-345762953.jpg") # Read image and extract width, height
results = inference(image, prompt + output_prompt)
# Clean the results and load results in list format
cln_results = json.loads(clean_results(results))
annotator = Annotator(image) # initialize Ultralytics annotator
for idx, item in enumerate(cln_results):
 # By default, gemini model return output with y coordinates first.
 # Scale normalized box coordinates (0–1000) to image dimensions
 y1, x1, y2, x2 = item["box_2d"] # bbox post processing,
 y1 = y1 / 1000 * h
 x1 = x1 / 1000 * w
 y2 = y2 / 1000 * h
 x2 = x2 / 1000 * w
 if x1 > x2:
 x1, x2 = x2, x1 # Swap x-coordinates if needed
 if y1 > y2:
 y1, y2 = y2, y1 # Swap y-coordinates if needed
 annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))
Image.fromarray(annotator.result()) # display the output

输出

如何使用Google Gemini模型完成计算机视觉任务？

图像来源：链接

解释

（1）该代码块包含一个复杂的提示，用于测试模型的推理能力。

（2）它使用图像宽度(w)和高度(h)将归一化边界框框坐标（0-1000）转换为像素坐标。

（3）注释器工具在图像的副本上绘制框和标签。

任务3：图像字幕

Gemini可以为图片创建文字描述。

复制

# Define the text prompt
prompt = """
What's inside the image, generate a detailed captioning in the form of short
story, Make 4-5 lines and start each sentence on a new line.
"""
image, _, _ = read_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg") # Read image and extract width, height
plt.imshow(image)
plt.axis('off') # Hide axes
plt.show()
print(inference(image, prompt)) # Display the results

输出

如何使用Google Gemini模型完成计算机视觉任务？

图像来源：链接

解释

（1）这个提示要求模型以特定风格生成描述（如叙事风格，限制为4行，并且每行独立成段）。

（2）所提供的图像显示在输出中。

（3）函数返回生成的文本。这对于创建所有文本或摘要非常有用。

任务4：光学字符识别（OCR）

Gemini可以读取图像中的文本，并告诉它在哪里找到了文本。

复制

# Define the text prompt
prompt = """
Extract the text from the image
"""
# Fixed, plotting function depends on this.
output_prompt = """
Return just box_2d which will be location of detected text areas + label"""
image, w, h = read_image("https://cdn.mos.cms.futurecdn.net/4sUeciYBZHaLoMa5KiYw7h-1200-80.jpg") # Read image and extract width, height
results = inference(image, prompt + output_prompt)
# Clean the results and load results in list format
cln_results = json.loads(clean_results(results))
print()
annotator = Annotator(image) # initialize Ultralytics annotator
for idx, item in enumerate(cln_results):
 # By default, gemini model return output with y coordinates first.
 # Scale normalized box coordinates (0–1000) to image dimensions
 y1, x1, y2, x2 = item["box_2d"] # bbox post processing,
 y1 = y1 / 1000 * h
 x1 = x1 / 1000 * w
 y2 = y2 / 1000 * h
 x2 = x2 / 1000 * w
 if x1 > x2:
 x1, x2 = x2, x1 # Swap x-coordinates if needed
 if y1 > y2:
 y1, y2 = y2, y1 # Swap y-coordinates if needed
 annotator.box_label([x1, y1, x2, y2], label=item["label"], color=colors(idx, True))
Image.fromarray(annotator.result()) # display the output

输出

如何使用Google Gemini模型完成计算机视觉任务？

图像来源：链接

解释

（1）它使用一个类似于对象检测的提示符，但要求输入文本（标签）而不是对象名称。

（2）代码提取文本及其位置，打印文本内容，并在图像上绘制对应的边界框。

（3）这对于数字化文档或从照片中的标志或标签中读取文本非常有用。

结论

通过简单的API调用，用于计算机视觉的代码段可以轻松处理对象检测、图像字幕和OCR等任务。通过发送图像以及清晰的文本说明，可以指导模型的理解，并获得可用的实时结果。

也就是说，虽然Gemini非常适合通用任务或快速实验，但它并不总是最适合高度专业化的用例。例如，当需要识别小众对象或对准确性有更高要求时，传统方法依然具有优势：收集数据集，使用YOLO标签器等工具对其进行注释，并根据需求训练定制模型。原文标题：How to Use Google Gemini Models for Computer Vision Tasks?，作者：Harsh Mishra

如何使用Google Gemini模型完成计算机视觉任务？

Google Gemini简介

Gemini 2.5 Pro的关键特性

数据注释的作用：YOLO注释器

什么是数据注释？

注释工具的主要特性（例如YOLO注释器）

代码实现——Google Gemin用于计算机视觉

步骤1：安装先决条件

（1）安装库

（2）导入模块

（3）配置API密钥

步骤2：与Gemini互动

解释

步骤3：准备图像数据

解释

步骤4：结果格式化

任务1：对象检测

输出

解释

任务2：测试推理能力

输出

解释

任务3：图像字幕

输出

解释

任务4：光学字符识别（OCR）

输出

解释

结论

相关资讯

Google AI Studio 生图功能升级：安全性误判降低，可用性显著提升

Google AI“炸裂”新功能曝光！下一代客户服务助手有多丝滑？

短短10天，Ilya神秘初创SSI再融10亿美元！仅凭一个主页估值300亿