How do I set up Vision Control Skill?

Install Python, ensure the script files from the SKILL.md are available, and run the provided Python scripts (capture_grid.py, click_location.py, mouse_action.py, keyboard_input.py) to begin GUI automation.

What platforms are supported?

The skill uses OS-level input automation via Python scripts; it should work on common desktop environments where Python and GUI automation permissions are available. Check your OS-specific requirements and grant automation permissions as needed.

How should I handle dynamic UI or delays?

Incorporate re-captures and checks before actions, and use grid-based targeting to re-verify element positions after UI changes. Consider adding waits or retries if elements are not immediately visible.

vision-control-skill

Scanned

npx machina-cli add skill askiichan/vision-control-skill/vision-control-skill --openclaw

Files (1)

SKILL.md

8.8 KB

Vision Control Skill

Control the user's computer through visual understanding with mouse and keyboard simulation.

Core Workflow

Capture screen with grid overlay - Take screenshot with coordinate grid
Analyze visual content - Determine what needs to be clicked
Identify target coordinates - Find grid cell containing the target
Execute click - Perform mouse click at the specified location

Scripts

capture_grid.py

Captures a screenshot with grid overlay. By default, saves to a temporary file with automatic timestamping. Optionally accepts a custom output path.

python scripts/capture_grid.py [output_path]

Examples:

python scripts/capture_grid.py                    # Auto-saves to temp directory
python scripts/capture_grid.py screen_grid.png    # Save to specific path

Use this to get a visual representation of the screen with coordinate labels.

click_location.py

Performs mouse clicks at the specified grid coordinate. Supports left-click, right-click, and double-click. Automatically captures a screenshot, performs the action, and cleans up the temporary screenshot file.

python scripts/click_location.py <grid_command> [--action ACTION] [--keep]

Examples:

python scripts/click_location.py S3                    # Left click
python scripts/click_location.py S3/2 --action right   # Right click at quadrant
python scripts/click_location.py A1 --action double    # Double click
python scripts/click_location.py M5 --keep             # Click and keep screenshot

Grid format: [COLUMN][ROW] or [COLUMN][ROW]/[QUADRANT]

Columns: A-Z (left to right)
Rows: 1-15 (top to bottom)
Quadrants: 1=top-left, 2=top-right, 3=bottom-left, 4=bottom-right

Options:

--action click|right|double: Mouse action type (default: click)
--keep: Preserve the screenshot file instead of auto-deleting (useful for debugging)

mouse_action.py

Comprehensive mouse control script supporting all mouse actions: click, right-click, double-click, move, drag, and scroll.

python scripts/mouse_action.py <action> <args...> [--keep]

Actions:

click <grid> - Left click at grid location
right <grid> - Right click at grid location
double <grid> - Double click at grid location
move <grid> - Move cursor to grid location (no click)
drag <from> <to> - Drag from one grid to another
scroll <grid> <amount> - Scroll at grid location (positive=up, negative=down)

Examples:

python scripts/mouse_action.py click S3        # Left click
python scripts/mouse_action.py right S3/2      # Right click at quadrant
python scripts/mouse_action.py double A1       # Double click
python scripts/mouse_action.py move M8         # Move cursor without clicking
python scripts/mouse_action.py drag S3 T5      # Drag from S3 to T5
python scripts/mouse_action.py scroll M8 5     # Scroll up 5 units
python scripts/mouse_action.py scroll M8 -3    # Scroll down 3 units

keyboard_input.py

Comprehensive keyboard control script supporting text typing, key presses, and keyboard shortcuts.

python scripts/keyboard_input.py <action> <args...>

Actions:

type <text> - Type text at current cursor position
press <key> - Press a single key (enter, tab, esc, etc.)
hotkey <key1+key2+...> - Press keyboard shortcut
click-type <grid> <text> - Click at grid location then type text

Examples:

python scripts/keyboard_input.py type "Hello World"
python scripts/keyboard_input.py press enter
python scripts/keyboard_input.py press tab
python scripts/keyboard_input.py hotkey ctrl+c
python scripts/keyboard_input.py hotkey ctrl+shift+v
python scripts/keyboard_input.py click-type M8 "search query"

Common keys: enter, tab, esc, space, backspace, delete, up, down, left, right, home, end, pageup, pagedown, f1-f12

vision_control_core.py

Core library providing the VisionControl class with comprehensive mouse and keyboard control methods:

Mouse Methods:

click_at_grid(command, button='left', clicks=1) - Click at grid location
right_click_at_grid(command) - Right click at grid location
double_click_at_grid(command) - Double click at grid location
move_to_grid(command, duration=0.5) - Move cursor to grid location
drag_to_grid(from_command, to_command, duration=1.0) - Drag between locations
scroll_at_grid(command, amount) - Scroll at grid location

Keyboard Methods:

type_text(text, interval=0.0) - Type text at current position
press_key(key) - Press a single key
press_keys(keys) - Press keyboard shortcut (hotkey)
click_and_type(command, text, interval=0.0) - Click then type

Utility Methods:

capture_with_grid(output_path=None, auto_temp=True) - Capture screenshot with grid
cleanup_screenshot() - Clean up temporary screenshot

Can be imported for custom workflows.

Typical Usage Patterns

Basic Click

When a user asks to click something on their screen:

Run capture_grid.py to get a screenshot with grid overlay (saved to temp directory)
Present the image to analyze where the target element is located
Identify the grid cell (e.g., "S3") or cell with quadrant (e.g., "S3/2")
Run click_location.py or mouse_action.py with the identified coordinate
Confirm the action was successful

Right-Click Menu

When a user needs to open a context menu:

Capture screen with capture_grid.py
Identify the target location
Run mouse_action.py right <grid> to right-click

Drag and Drop

When a user needs to drag something:

Capture screen with capture_grid.py
Identify source and target locations
Run mouse_action.py drag <from> <to>

Scrolling

When a user needs to scroll:

Capture screen with capture_grid.py
Identify where to scroll
Run mouse_action.py scroll <grid> <amount> (positive=up, negative=down)

Typing Text

When a user needs to type or enter text:

For typing at current cursor: Run keyboard_input.py type "text"
For filling a form field:
- Capture screen with capture_grid.py
- Identify the input field location
- Run keyboard_input.py click-type <grid> "text"

Keyboard Shortcuts

When a user needs to use keyboard shortcuts:

Run keyboard_input.py hotkey <keys> (e.g., ctrl+c, ctrl+v, alt+tab)
Common shortcuts: Copy (ctrl+c), Paste (ctrl+v), Save (ctrl+s), Undo (ctrl+z)

Screenshot Management

The skill uses automatic temporary file management:

Auto-temp storage: Screenshots are saved to the system temp directory with timestamps
Auto-cleanup: click_location.py automatically deletes screenshots after clicking
Privacy-friendly: No screenshot clutter on disk
Debug mode: Use --keep flag to preserve screenshots when troubleshooting

Temp file format: vision_control_YYYYMMDD_HHMMSS_microseconds.png

Reference Documentation

Grid System: See grid_system.md for coordinate system and quadrant layout
Keyboard Actions: See keyboard_actions.md for comprehensive keyboard reference, key names, and shortcuts

Dependencies

Required Python packages:

pyautogui - Mouse control
mss - Fast screenshot capture
Pillow (PIL) - Image processing

Install with:

pip install pyautogui mss Pillow

Supported Actions

Mouse Actions

✅ Left Click - Single click with left mouse button
✅ Right Click - Open context menus
✅ Double Click - Open files or select text
✅ Move - Position cursor without clicking
✅ Drag - Drag and drop between locations
✅ Scroll - Scroll up or down at a location

Keyboard Actions

✅ Type Text - Enter text at cursor position
✅ Press Key - Press single keys (enter, tab, esc, etc.)
✅ Keyboard Shortcuts - Execute hotkeys (ctrl+c, alt+tab, etc.)
✅ Click and Type - Click location then type text

Safety Considerations

This skill performs actual mouse actions on the user's system
Always confirm the target location before performing actions
Be aware that actions may trigger immediate effects in applications
Consider asking for user confirmation before critical or destructive actions
Drag operations may take 1+ seconds to complete

Source

git clone https://github.com/askiichan/vision-control-skill/blob/main/skills/vision-control-skill/SKILL.mdView on GitHub

Overview

Vision Control Skill lets AI operate your computer by analyzing screen content and simulating mouse and keyboard actions. It captures screenshots with a coordinate grid, identifies targets, and executes precise clicks, typing, dragging, scrolling, and shortcuts to emulate real user interactions.

How This Skill Works

Core workflow: capture the screen with a grid, analyze the visual content to find what to click, determine the target grid coordinates, and execute the requested mouse or keyboard action. It provides scripts like capture_grid.py, click_location.py, mouse_action.py, and keyboard_input.py to perform these steps and clean up temporary screenshots after actions.

When to Use It

User requests the AI to control the computer directly and perform GUI interactions.
Need to click on on-screen elements such as buttons, links, or menus based on their visual position.
Require typing text or sending keyboard input into forms, fields, or editors.
Execute keyboard shortcuts or combined actions to operate applications or scripts.
Automate visible UI tasks that involve dragging, scrolling, or multi-step GUI flows.

Quick Start

Step 1: Run python scripts/capture_grid.py to capture the screen with a coordinate grid.
Step 2: Use python scripts/click_location.py <grid> to click a target, or python scripts/mouse_action.py to perform complex actions.
Step 3: If you need text input or shortcuts, run python scripts/keyboard_input.py <action> <args...> to type or press keys.

Best Practices

Always start with capture_grid.py to generate a grid-backed screenshot for reliable targeting.
Target absolute grid coordinates rather than relative movements to reduce drift.
Verify the target is visible and not obscured before clicking; consider re-capturing if UI changes.
Let mouse_action.py handle complex actions (drag, scroll) and keep actions atomic when possible.
Use keyboard_input.py for text entry and hotkeys to ensure consistent input across apps.

Example Use Cases

Open a web app, locate and click the login button, then type username and password into fields.
Navigate a desktop application by selecting menus, expanding options, and confirming dialogs via clicks and keyboard shortcuts.
Fill out a form in a browser by typing text into fields and submitting with a hotkey combo.
Drag a file from a folder view to another window and scroll through lists to reach the target item.
Scroll a document or feed to load more content and interact with elements revealed by the scroll.

Frequently Asked Questions

Add this skill to your agents