How does the Computer Use Agent ensure safety?

It includes a FAILSAFE mode that moves the mouse to the upper-left corner to abort, configurable action delays, step limits, optional confirmation before actions, and screenshot logging for auditing.

Can it automate any desktop application?

Yes. By using vision to identify on-screen UI elements, it can control native apps, file explorers, and system dialogs without needing APIs.

How do I configure limits and logging?

Configure action delays, maximum step counts, and enable screenshot logging in the MCP integration; logs provide an audit trail for debugging and verification.

computer-use-agent

Scanned

npx machina-cli add skill wrm3/ai_project_template/computer-use-agent --openclaw

Files (1)

SKILL.md

10.3 KB

Computer Use Agent Skill

AI-powered desktop automation that uses vision models to understand your screen and execute actions to accomplish tasks.

Overview

The Computer Use Agent Skill enables automated control of desktop applications through:

Screenshot Analysis: Captures and analyzes your screen using AI vision
Intelligent Actions: AI determines what to click, type, or interact with
Multi-Step Workflows: Executes complex tasks across multiple steps
Safety Controls: Built-in failsafes and rate limiting

Unlike browser automation, this skill can control any desktop application including native apps, file explorers, and system dialogs.

When to Use This Skill

Activate this skill when the user needs to:

Automate repetitive desktop tasks
Fill out forms in native applications
Navigate complex desktop UIs
Test desktop software workflows
Control applications without APIs
Perform GUI automation tasks
Automate multi-application workflows

Capabilities

Desktop Control

Click Actions: Click buttons, links, menu items at precise coordinates
Text Input: Type text into forms, text fields, search boxes
Keyboard Control: Press Enter, Tab, Ctrl+C, and other key combinations
Mouse Movement: Move mouse cursor to specific locations
Scrolling: Scroll windows up, down, left, right
Drag and Drop: Drag elements from one location to another

Workflow Features

Multi-Step Tasks: Execute sequences of actions to complete complex tasks
Screen Analysis: AI vision understands what's on screen
Action Planning: AI determines best sequence of actions
Error Recovery: Detect and recover from unexpected states
Progress Tracking: Monitor task progress through multiple steps

Safety Features

FAILSAFE Mode: Move mouse to upper-left corner to immediately abort
Action Delays: Configurable pause between actions (default 0.5s)
Step Limits: Maximum steps to prevent infinite loops
Confirmation Mode: Optional user confirmation before executing actions
Screenshot Logging: Save screenshots for audit trail and debugging

How It Works

Basic Workflow

User describes the task to accomplish
Skill captures screenshot of current screen
Screenshot analyzed by AI vision model
AI recommends next action (click, type, scroll, etc.)
Action executed via desktop automation tools
Process repeats until task complete or max steps reached

Vision-Guided Automation

The skill uses advanced AI vision to:

Identify UI elements (buttons, forms, menus)
Read text on screen
Understand application state
Determine appropriate actions
Verify task completion

MCP Tools Available

This skill integrates with the fstrent_mcp_computer_use MCP server, which provides:

computer_use_run_task: Execute a complete multi-step desktop automation task
computer_use_screenshot: Capture screenshot of current screen
computer_use_click: Click at specific screen coordinates
computer_use_type: Type text into focused element
computer_use_key_press: Press specific keyboard keys
computer_use_scroll: Scroll in specified direction
computer_use_mouse_move: Move mouse to coordinates

All tools include built-in safety features and error handling.

Usage Examples

Example 1: Automate Form Filling

User: "Fill out the customer feedback form with a 5-star rating and the comment 'Great service!'"

Workflow:
1. Capture screenshot to locate form
2. Click on rating field
3. Click 5-star rating
4. Click on comment field
5. Type "Great service!"
6. Click submit button
7. Verify submission complete

Example 2: Navigate Application Menu

User: "Open the File menu and select 'Export as PDF'"

Workflow:
1. Capture screenshot to locate File menu
2. Click "File" menu
3. Wait for menu to open
4. Capture screenshot of open menu
5. Click "Export as PDF"
6. Verify dialog opened

Example 3: Desktop Cleanup

User: "Move all files from Downloads to Documents folder"

Workflow:
1. Open File Explorer
2. Navigate to Downloads folder
3. Select all files
4. Cut files (Ctrl+X)
5. Navigate to Documents
6. Paste files (Ctrl+V)
7. Verify files moved

Safety Guidelines

Critical Safety Considerations

This skill controls your actual computer. Always follow these safety guidelines:

Start Simple: Test with simple, low-risk tasks first
Watch Closely: Monitor the automation as it runs
Use FAILSAFE: Move mouse to upper-left corner to abort anytime
Avoid Sensitive Data: Don't automate tasks involving passwords or sensitive info
Test in Safe Environments: Use test accounts or sandboxed environments
Verify Actions: Use confirmation mode for critical tasks
Backup Important Data: Always backup before automating file operations

Recommended Settings

For First-Time Use:

Enable confirmation mode
Use slower action delays (1-2 seconds)
Set lower max steps (10-15)
Enable screenshot logging

For Production Use:

Disable confirmation for trusted workflows
Use standard delays (0.5 seconds)
Set appropriate max steps for task complexity
Enable screenshot logging for audit trails

Configuration

Safety Settings

# Recommended safety configuration
failsafe_enabled: true          # Enable FAILSAFE (move mouse to corner to abort)
action_delay: 0.5               # Seconds between actions (0.5-2.0)
max_steps: 50                   # Maximum automation steps
confirmation_mode: false        # Require confirmation before each action
screenshot_logging: true        # Save screenshots for audit trail

Performance Settings

# Performance tuning
screenshot_quality: 80          # JPEG quality (60-95)
vision_model: gpt-4o           # Vision model to use
retry_on_error: true           # Retry failed actions
max_retries: 3                 # Maximum retry attempts

Integration Requirements

MCP Server Setup

Ensure fstrent_mcp_computer_use is configured in .mcp.json:

{
  "mcpServers": {
    "fstrent_mcp_computer_use": {
      "command": "uv",
      "args": [
        "--directory",
        "C:\\.ai\\mcps\\fstrent_mcp_computer_use",
        "run",
        "fstrent_mcp_computer_use"
      ]
    }
  }
}

Required Dependencies

The MCP server handles all dependencies. No additional setup required.

API Keys

Requires OpenAI API key with access to vision models (gpt-4o or gpt-4-turbo).

Troubleshooting

Common Issues

Automation Not Starting

Verify MCP server is running
Check OpenAI API key is configured
Ensure pyautogui is not blocked by OS permissions

Actions Missing Target

Increase action delay for slower UIs
Verify screen resolution matches coordinates
Check for UI scaling (125%, 150%, etc.)

FAILSAFE Triggering Accidentally

Adjust FAILSAFE corner location
Increase action delay
Move mouse away from corners during setup

Vision Model Not Understanding Screen

Use higher screenshot quality
Ensure text is legible in screenshot
Verify adequate screen contrast
Try describing task in more detail

Comparison to Other Automation

Computer Use Agent vs Browser Automation

Computer Use: Controls entire desktop, any application
Browser Automation: Only controls web browsers and web pages
Use Computer Use For: Native apps, system dialogs, file operations
Use Browser Automation For: Web scraping, web app testing, form submission

Computer Use Agent vs MCP Tools

Computer Use: AI-guided vision-based automation
MCP Tools: Direct programmatic control
Use Computer Use For: Complex UIs, visual workflows, unpredictable states
Use MCP Tools For: Databases, APIs, file systems, structured data

Advanced Features

Multi-Application Workflows

Automate tasks spanning multiple applications:

Example: "Export data from Excel, import to database, generate PDF report"
1. Open Excel, select data range, export CSV
2. Open database tool, import CSV
3. Run report query, export to PDF

Error Recovery

The skill can detect and recover from errors:

Detect unexpected dialogs and close them
Retry failed actions with different parameters
Navigate back to known good state
Alert user if unrecoverable error occurs

Screenshot Analysis

AI vision can:

Read text from screenshots
Identify UI element types
Determine application state
Locate specific elements
Verify expected outcomes

Best Practices

Start with Simple Tasks: Build confidence with basic workflows
Use Descriptive Task Descriptions: Clear descriptions = better AI understanding
Monitor First Runs: Watch automation closely the first time
Enable Logging: Screenshot logs help debug issues
Test in Safe Environments: Use test accounts and non-production systems
Document Workflows: Save successful task descriptions for reuse
Use Appropriate Delays: Faster isn't always better - allow UIs to respond
Set Realistic Step Limits: Complex tasks need more steps
Keep Tasks Focused: Break complex workflows into smaller tasks
Review Audit Trails: Check screenshot logs periodically

Reference Materials

For detailed implementation information, see:

reference/safety_guidelines.md - Comprehensive safety documentation
reference/action_types.md - All supported action types
reference/troubleshooting.md - Common issues and solutions
examples/automation_workflows.md - Example workflows

Related Skills

web-tools: For browser automation and web scraping
database-tools: For database operations
file-operations: For file system automation

Note: This is a powerful skill that controls your actual computer. Always prioritize safety, use FAILSAFE mode, and test in safe environments. Start simple and build confidence before attempting complex automations.

Source

git clone https://github.com/wrm3/ai_project_template/blob/main/.claude/skills/computer-use-agent/SKILL.mdView on GitHub

Overview

AI-powered desktop automation that analyzes screenshots with vision models to decide where to click, type, or interact. It can control any desktop application, enabling multi-step workflows across native apps, file explorers, and system dialogs, with built-in safety and audit features.

How This Skill Works

Workflow: describe task, capture screenshot, analyze with AI vision, select next action, and execute. It repeats until completion or a configured max step limit. Vision-guided automation uses AI to identify UI elements, read text, understand app state, and verify success.

When to Use It

Automate repetitive desktop tasks
Fill out forms in native applications
Navigate complex desktop UIs across multiple apps
Test desktop software workflows without APIs
Automate multi-application GUI tasks

Quick Start

Step 1: Describe the task you want automated.
Step 2: Let the skill capture a screenshot and analyze the UI with AI vision.
Step 3: The AI selects actions (click, type, scroll) and executes them until the task completes or max steps are reached.

Best Practices

Start with small, well-scoped tasks to tune action planning.
Enable SAFETY features: failsafe, action delays, and step limits.
Use explicit screen analysis and verify element detection.
Enable screenshot logging for auditing and debugging.
Design workflows with robust error recovery and progress tracking.

Example Use Cases

Example 1: Automate Form Filling — Fill out the customer feedback form with a 5-star rating and the comment 'Great service!'
Example 2: CRM Data Entry — Automatically enter new lead details into a native CRM app by navigating fields and saving records.
Example 3: File Management — Batch rename and move files in File Explorer based on metadata and folder rules.
Example 4: Setup Wizard Automation — Navigate multi-step installation wizards across windows and dialogs, configuring options as needed.
Example 5: GUI Regression Testing — Click through menus, enter data, and verify UI states to catch regressions.

Frequently Asked Questions

Add this skill to your agents