vision-control-skill
Scannednpx machina-cli add skill askiichan/vision-control-skill/vision-control-skill --openclawVision Control Skill
Control the user's computer through visual understanding with mouse and keyboard simulation.
Core Workflow
- Capture screen with grid overlay - Take screenshot with coordinate grid
- Analyze visual content - Determine what needs to be clicked
- Identify target coordinates - Find grid cell containing the target
- Execute click - Perform mouse click at the specified location
Scripts
capture_grid.py
Captures a screenshot with grid overlay. By default, saves to a temporary file with automatic timestamping. Optionally accepts a custom output path.
python scripts/capture_grid.py [output_path]
Examples:
python scripts/capture_grid.py # Auto-saves to temp directory
python scripts/capture_grid.py screen_grid.png # Save to specific path
Use this to get a visual representation of the screen with coordinate labels.
click_location.py
Performs mouse clicks at the specified grid coordinate. Supports left-click, right-click, and double-click. Automatically captures a screenshot, performs the action, and cleans up the temporary screenshot file.
python scripts/click_location.py <grid_command> [--action ACTION] [--keep]
Examples:
python scripts/click_location.py S3 # Left click
python scripts/click_location.py S3/2 --action right # Right click at quadrant
python scripts/click_location.py A1 --action double # Double click
python scripts/click_location.py M5 --keep # Click and keep screenshot
Grid format: [COLUMN][ROW] or [COLUMN][ROW]/[QUADRANT]
- Columns: A-Z (left to right)
- Rows: 1-15 (top to bottom)
- Quadrants: 1=top-left, 2=top-right, 3=bottom-left, 4=bottom-right
Options:
--action click|right|double: Mouse action type (default: click)--keep: Preserve the screenshot file instead of auto-deleting (useful for debugging)
mouse_action.py
Comprehensive mouse control script supporting all mouse actions: click, right-click, double-click, move, drag, and scroll.
python scripts/mouse_action.py <action> <args...> [--keep]
Actions:
click <grid>- Left click at grid locationright <grid>- Right click at grid locationdouble <grid>- Double click at grid locationmove <grid>- Move cursor to grid location (no click)drag <from> <to>- Drag from one grid to anotherscroll <grid> <amount>- Scroll at grid location (positive=up, negative=down)
Examples:
python scripts/mouse_action.py click S3 # Left click
python scripts/mouse_action.py right S3/2 # Right click at quadrant
python scripts/mouse_action.py double A1 # Double click
python scripts/mouse_action.py move M8 # Move cursor without clicking
python scripts/mouse_action.py drag S3 T5 # Drag from S3 to T5
python scripts/mouse_action.py scroll M8 5 # Scroll up 5 units
python scripts/mouse_action.py scroll M8 -3 # Scroll down 3 units
keyboard_input.py
Comprehensive keyboard control script supporting text typing, key presses, and keyboard shortcuts.
python scripts/keyboard_input.py <action> <args...>
Actions:
type <text>- Type text at current cursor positionpress <key>- Press a single key (enter, tab, esc, etc.)hotkey <key1+key2+...>- Press keyboard shortcutclick-type <grid> <text>- Click at grid location then type text
Examples:
python scripts/keyboard_input.py type "Hello World"
python scripts/keyboard_input.py press enter
python scripts/keyboard_input.py press tab
python scripts/keyboard_input.py hotkey ctrl+c
python scripts/keyboard_input.py hotkey ctrl+shift+v
python scripts/keyboard_input.py click-type M8 "search query"
Common keys: enter, tab, esc, space, backspace, delete, up, down, left, right, home, end, pageup, pagedown, f1-f12
vision_control_core.py
Core library providing the VisionControl class with comprehensive mouse and keyboard control methods:
Mouse Methods:
click_at_grid(command, button='left', clicks=1)- Click at grid locationright_click_at_grid(command)- Right click at grid locationdouble_click_at_grid(command)- Double click at grid locationmove_to_grid(command, duration=0.5)- Move cursor to grid locationdrag_to_grid(from_command, to_command, duration=1.0)- Drag between locationsscroll_at_grid(command, amount)- Scroll at grid location
Keyboard Methods:
type_text(text, interval=0.0)- Type text at current positionpress_key(key)- Press a single keypress_keys(keys)- Press keyboard shortcut (hotkey)click_and_type(command, text, interval=0.0)- Click then type
Utility Methods:
capture_with_grid(output_path=None, auto_temp=True)- Capture screenshot with gridcleanup_screenshot()- Clean up temporary screenshot
Can be imported for custom workflows.
Typical Usage Patterns
Basic Click
When a user asks to click something on their screen:
- Run
capture_grid.pyto get a screenshot with grid overlay (saved to temp directory) - Present the image to analyze where the target element is located
- Identify the grid cell (e.g., "S3") or cell with quadrant (e.g., "S3/2")
- Run
click_location.pyormouse_action.pywith the identified coordinate - Confirm the action was successful
Right-Click Menu
When a user needs to open a context menu:
- Capture screen with
capture_grid.py - Identify the target location
- Run
mouse_action.py right <grid>to right-click
Drag and Drop
When a user needs to drag something:
- Capture screen with
capture_grid.py - Identify source and target locations
- Run
mouse_action.py drag <from> <to>
Scrolling
When a user needs to scroll:
- Capture screen with
capture_grid.py - Identify where to scroll
- Run
mouse_action.py scroll <grid> <amount>(positive=up, negative=down)
Typing Text
When a user needs to type or enter text:
- For typing at current cursor: Run
keyboard_input.py type "text" - For filling a form field:
- Capture screen with
capture_grid.py - Identify the input field location
- Run
keyboard_input.py click-type <grid> "text"
- Capture screen with
Keyboard Shortcuts
When a user needs to use keyboard shortcuts:
- Run
keyboard_input.py hotkey <keys>(e.g.,ctrl+c,ctrl+v,alt+tab) - Common shortcuts: Copy (
ctrl+c), Paste (ctrl+v), Save (ctrl+s), Undo (ctrl+z)
Screenshot Management
The skill uses automatic temporary file management:
- Auto-temp storage: Screenshots are saved to the system temp directory with timestamps
- Auto-cleanup:
click_location.pyautomatically deletes screenshots after clicking - Privacy-friendly: No screenshot clutter on disk
- Debug mode: Use
--keepflag to preserve screenshots when troubleshooting
Temp file format: vision_control_YYYYMMDD_HHMMSS_microseconds.png
Reference Documentation
- Grid System: See grid_system.md for coordinate system and quadrant layout
- Keyboard Actions: See keyboard_actions.md for comprehensive keyboard reference, key names, and shortcuts
Dependencies
Required Python packages:
- pyautogui - Mouse control
- mss - Fast screenshot capture
- Pillow (PIL) - Image processing
Install with:
pip install pyautogui mss Pillow
Supported Actions
Mouse Actions
- ✅ Left Click - Single click with left mouse button
- ✅ Right Click - Open context menus
- ✅ Double Click - Open files or select text
- ✅ Move - Position cursor without clicking
- ✅ Drag - Drag and drop between locations
- ✅ Scroll - Scroll up or down at a location
Keyboard Actions
- ✅ Type Text - Enter text at cursor position
- ✅ Press Key - Press single keys (enter, tab, esc, etc.)
- ✅ Keyboard Shortcuts - Execute hotkeys (ctrl+c, alt+tab, etc.)
- ✅ Click and Type - Click location then type text
Safety Considerations
- This skill performs actual mouse actions on the user's system
- Always confirm the target location before performing actions
- Be aware that actions may trigger immediate effects in applications
- Consider asking for user confirmation before critical or destructive actions
- Drag operations may take 1+ seconds to complete
Source
git clone https://github.com/askiichan/vision-control-skill/blob/main/skills/vision-control-skill/SKILL.mdView on GitHub Overview
Vision Control Skill lets AI operate your computer by analyzing screen content and simulating mouse and keyboard actions. It captures screenshots with a coordinate grid, identifies targets, and executes precise clicks, typing, dragging, scrolling, and shortcuts to emulate real user interactions.
How This Skill Works
Core workflow: capture the screen with a grid, analyze the visual content to find what to click, determine the target grid coordinates, and execute the requested mouse or keyboard action. It provides scripts like capture_grid.py, click_location.py, mouse_action.py, and keyboard_input.py to perform these steps and clean up temporary screenshots after actions.
When to Use It
- User requests the AI to control the computer directly and perform GUI interactions.
- Need to click on on-screen elements such as buttons, links, or menus based on their visual position.
- Require typing text or sending keyboard input into forms, fields, or editors.
- Execute keyboard shortcuts or combined actions to operate applications or scripts.
- Automate visible UI tasks that involve dragging, scrolling, or multi-step GUI flows.
Quick Start
- Step 1: Run python scripts/capture_grid.py to capture the screen with a coordinate grid.
- Step 2: Use python scripts/click_location.py <grid> to click a target, or python scripts/mouse_action.py to perform complex actions.
- Step 3: If you need text input or shortcuts, run python scripts/keyboard_input.py <action> <args...> to type or press keys.
Best Practices
- Always start with capture_grid.py to generate a grid-backed screenshot for reliable targeting.
- Target absolute grid coordinates rather than relative movements to reduce drift.
- Verify the target is visible and not obscured before clicking; consider re-capturing if UI changes.
- Let mouse_action.py handle complex actions (drag, scroll) and keep actions atomic when possible.
- Use keyboard_input.py for text entry and hotkeys to ensure consistent input across apps.
Example Use Cases
- Open a web app, locate and click the login button, then type username and password into fields.
- Navigate a desktop application by selecting menus, expanding options, and confirming dialogs via clicks and keyboard shortcuts.
- Fill out a form in a browser by typing text into fields and submitting with a hotkey combo.
- Drag a file from a folder view to another window and scroll through lists to reach the target item.
- Scroll a document or feed to load more content and interact with elements revealed by the scroll.