Compare commits

...

4 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
cbc13366c6 Add comprehensive documentation and examples for AI detection system
Co-authored-by: blackpiglet <59276555+blackpiglet@users.noreply.github.com>
2026-02-02 03:25:37 +00:00
copilot-swe-agent[bot]
664b25cca1 Fix YAML syntax and validate AI detection workflow
Co-authored-by: blackpiglet <59276555+blackpiglet@users.noreply.github.com>
2026-02-02 03:23:57 +00:00
copilot-swe-agent[bot]
3504943019 Add AI-generated issue detection system with workflow and documentation
Co-authored-by: blackpiglet <59276555+blackpiglet@users.noreply.github.com>
2026-02-02 03:22:43 +00:00
copilot-swe-agent[bot]
acd4d5b183 Initial plan 2026-02-02 03:19:24 +00:00
6 changed files with 636 additions and 0 deletions

197
.github/AI-DETECTION-EXAMPLES.md vendored Normal file
View File

@@ -0,0 +1,197 @@
# AI Issue Detection - Examples
This document provides examples to help understand what triggers AI detection.
## Example 1: High AI Score (Score: 6/8) ❌
**This would be flagged:**
```markdown
## Description
When deploying Velero on an EKS cluster with `hostNetwork: true`, the application fails to start.
## Critical Problem
```
time="2026-01-26T16:40:55Z" level=fatal msg="failed to start metrics server"
```
Status: BLOCKER
## Affected Environment
| Parameter | Value |
|----------|----------|
| Cluster | Amazon EKS |
| Velero Version | 1.8.2 |
| Kubernetes | 1.33 |
## Root Cause Analysis
The controller-runtime metrics uses port 8080 as a hardcoded default...
## Resolution Attempts
### Attempt 1: Use extraArgs
Result: Failed
### Attempt 2: Configure metricsAddress
Result: Failed
## Expected Permanent Solution
Velero should:
1. Auto-detect an available port
2. Accept configuring the controller-runtime port
## Questions for Maintainers
1. Why does controller-runtime use hardcoded 8080?
2. Is there a roadmap to support hostNetwork?
## Labels and Metadata
Severity: CRITICAL
```
**Why flagged (Patterns detected: 6/8):**
-`futureDates` - References "2026-01-26" and "Kubernetes 1.33"
-`excessiveHeaders` - 8+ section headers
-`formalPhrases` - "Root Cause Analysis", "Expected Permanent Solution", "Questions for Maintainers", "Labels and Metadata"
-`aiSectionHeaders` - "## Description", "## Critical Problem", "## Affected Environment", "## Resolution Attempts"
-`perfectFormatting` - Perfect table structure
-`genericSolutions` - Mentions "auto-detect"
---
## Example 2: Medium AI Score (Score: 2/8) ✅
**This would NOT be flagged (below threshold):**
```markdown
**What steps did you take and what happened:**
I'm trying to restore a backup but getting this error:
```
error: backup "my-backup" not found
```
**What did you expect to happen:**
The backup should restore successfully
**Environment:**
- Velero version: 1.13.0
- Kubernetes version: 1.28
- Cloud provider: AWS
**Additional context:**
I can see the backup in S3 but Velero doesn't list it. Running `velero backup get` shows no backups.
```
**Why NOT flagged (Patterns detected: 2/8):**
-`futureDates` - Uses realistic versions
-`excessiveHeaders` - Only 3 headers
-`formalPhrases` - No formal AI phrases
-`excessiveTables` - Has a table but only 1
-`perfectFormatting` - Normal formatting
-`aiSectionHeaders` - Standard issue template headers
-`excessiveFormatting` - Has code blocks
-`genericSolutions` - No generic solutions
---
## Example 3: Legitimate Detailed Issue (Score: 3/8) ⚠️
**This would be flagged but is actually legitimate:**
```markdown
## Problem Description
VolumeGroupSnapshot restore fails with Ceph RBD driver.
## Environment
- Velero: 1.14.0
- Kubernetes: 1.28.3
- ODF: 4.14.2 with Ceph RBD CSI driver
## Root Cause
Ceph RBD stores group snapshot metadata in journal as `csi.groupid` omap key. During restore, when creating pre-provisioned VSC, the RBD driver reads this and populates `status.volumeGroupSnapshotHandle`.
The CSI snapshot controller looks for a VGSC with matching handle. Since Velero deletes VGSC after backup, it's not found.
## Reproduction Steps
1. Create backup with VGS
2. Delete namespace
3. Restore backup
4. Observe VS stuck with "cannot find group snapshot"
## Workaround
Create stub VGSC with matching `volumeGroupSnapshotHandle` and patch status.
## Proposed Fix
1. Backup: Capture `volumeGroupSnapshotHandle` in CSISnapshotInfo
2. Restore: Create stub VGSC if handle exists
## Code References
- Ceph RBD: https://github.com/ceph/ceph-csi/blob/devel/internal/rbd/snapshot.go#L167
- Velero deletion: https://github.com/vmware-tanzu/velero/blob/main/pkg/backup/actions/csi/pvc_action.go#L1124
```
**Why flagged (Patterns detected: 3/8):**
-`futureDates` - Uses current versions
-`excessiveHeaders` - Has 6 section headers
-`formalPhrases` - "Root Cause", "Proposed Fix"
-`excessiveTables` - No tables
-`perfectFormatting` - Normal formatting
-`aiSectionHeaders` - Technical, not generic
-`excessiveFormatting` - Reasonable formatting
-`genericSolutions` - Structured solution with code refs
**Maintainer Action**: This is a legitimate, well-researched issue. Verify the details with the contributor and remove the `potential-ai-generated` label.
---
## Example 4: Simple Valid Issue (Score: 0/8) ✅
**This would NOT be flagged:**
```markdown
Velero backup fails with error: `rpc error: code = Unavailable desc = connection error`
Running Velero 1.13 on GKE. Backups were working yesterday but now all fail with this error.
Logs show the node-agent pod is crashing. Any ideas?
```
**Why NOT flagged (Patterns detected: 0/8):**
- All patterns: None detected
---
## Key Takeaways
### Will Trigger Detection ❌
- Future dates/versions (2026+, K8s 1.33+)
- 4+ formal AI phrases
- 8+ section headers
- Perfect table formatting across multiple tables
- Generic AI section titles
- Auto-detect/generic solution patterns
### Will NOT Trigger ✅
- Realistic version numbers
- Actual error messages from real systems
- Normal issue formatting
- Moderate level of detail
- Standard GitHub issue template
### May Trigger (But Legitimate) ⚠️
- Very detailed technical analysis
- Multiple code references
- Well-structured proposals
- Extensive testing documentation
For these cases, maintainers will verify with the contributor and remove the flag once confirmed.

80
.github/AI-DETECTION-README.md vendored Normal file
View File

@@ -0,0 +1,80 @@
# AI-Generated Content Detection
This directory contains the AI-generated content detection system for Velero issues.
## Overview
The Velero project has implemented automated detection of potentially AI-generated issues to help maintain quality and ensure that issues describe real, verified problems.
## How It Works
### Detection Workflow
The workflow (`.github/workflows/ai-issue-detector.yml`) runs automatically when:
- A new issue is opened
- An existing issue is edited
### Detection Patterns
The detector analyzes issues for several AI-generation patterns:
1. **Excessive Tables** - More than 5 markdown tables
2. **Excessive Headers** - More than 8 consecutive section headers
3. **Formal Phrases** - Multiple formal section headers typical of AI (e.g., "Root Cause Analysis", "Operational Impact", "Expected Permanent Solution")
4. **Excessive Formatting** - Multiple horizontal rules and perfect formatting
5. **Future Dates** - Version numbers or dates that are unrealistic or in the future
6. **Perfect Formatting** - Overly structured tables with perfect alignment
7. **AI Section Headers** - Generic AI-style headers like "Critical Problem", "Resolution Attempts"
8. **Generic Solutions** - Auto-generated solution patterns with multiple YAML examples
### Scoring System
Each detected pattern adds to the AI score. If the score is 3 or higher (out of 8), the issue is flagged as potentially AI-generated.
### Actions Taken
When an issue is flagged:
1. A `potential-ai-generated` label is added
2. A `needs-triage` label is added
3. An automated comment is posted explaining:
- Why the issue was flagged
- What patterns were detected
- Guidelines for contributors to follow
- Request for verification
## For Contributors
If your issue is flagged:
1. **Don't panic** - This is not an accusation, just a request for verification
2. **Review the guidelines** in our [Code Standards](../site/content/docs/main/code-standards.md#ai-generated-content)
3. **Verify your content**:
- Ensure all version numbers are accurate
- Confirm error messages are from your actual environment
- Remove any placeholder or example content
- Simplify overly structured formatting
4. **Update the issue** with corrections if needed
5. **Comment to confirm** that the issue describes a real problem
## For Maintainers
When reviewing flagged issues:
1. Check if the technical details are realistic and verifiable
2. Look for signs of hallucinated content (fake version numbers, non-existent features)
3. Engage with the issue author to verify the problem
4. Remove the `potential-ai-generated` label once verified
5. Close issues that cannot be verified or describe non-existent problems
## Configuration
The detection patterns can be adjusted in the workflow file if needed. The threshold is currently set at 3 out of 8 patterns to balance false positives with detection accuracy.
## False Positives
The detector may occasionally flag legitimate issues, especially those that are:
- Very detailed and well-structured
- Using formal technical documentation style
- Reporting complex problems with extensive details
This is intentional - we prefer to verify detailed issues rather than miss AI-generated ones.

186
.github/MAINTAINER-AI-DETECTION-GUIDE.md vendored Normal file
View File

@@ -0,0 +1,186 @@
# Maintainer Guide: AI-Generated Issue Detection
This guide helps Velero maintainers understand and work with the AI-generated issue detection system.
## Overview
The AI detection system automatically analyzes new and edited issues to identify potential AI-generated content. This helps maintain issue quality and ensures contributors verify their submissions.
## How It Works
### Automatic Detection
When an issue is opened or edited, the workflow:
1. **Analyzes** the issue body for 8 different AI patterns
2. **Calculates** an AI confidence score (0-8)
3. **If score ≥ 3**: Adds labels and posts a comment
4. **If score < 3**: Takes no action (issue proceeds normally)
### Detection Patterns
| Pattern | Description | Weight |
|---------|-------------|--------|
| `excessiveTables` | More than 5 markdown tables | 1 |
| `excessiveHeaders` | More than 8 section headers | 1 |
| `formalPhrases` | 4+ AI-typical phrases (e.g., "Root Cause Analysis") | 1 |
| `excessiveFormatting` | Multiple horizontal rules (---) | 1 |
| `futureDates` | Dates/versions in 2026+ or 2030s | 1 |
| `perfectFormatting` | Multiple identical table structures | 1 |
| `aiSectionHeaders` | 4+ generic AI headers (e.g., "Critical Problem") | 1 |
| `genericSolutions` | Auto-detect patterns with multiple YAML blocks | 1 |
## Working with Flagged Issues
### Step 1: Review the Issue
When you see an issue labeled `potential-ai-generated`:
1. **Read the issue carefully**
2. **Check the detected patterns** (listed in the auto-comment)
3. **Look for red flags**:
- Future version numbers (e.g., "Kubernetes 1.33")
- Future dates (e.g., "2026-01-27")
- Non-existent features or configurations
- Perfect table formatting with no actual content
- Generic solutions that don't match Velero's architecture
### Step 2: Engage with the Contributor
**If the issue seems legitimate but over-formatted:**
```markdown
Thanks for the detailed report! Could you confirm:
1. Are you running Velero version X.Y.Z (you mentioned version A.B.C)?
2. Is the error message exactly as shown?
3. Have you actually tried the workarounds mentioned?
Once verified, we'll remove the AI-generated flag and investigate.
```
**If the issue appears to be unverified AI content:**
```markdown
This issue appears to contain AI-generated content that hasn't been verified.
Please review our [AI contribution guidelines](https://github.com/vmware-tanzu/velero/blob/main/site/content/docs/main/code-standards.md#ai-generated-content) and:
1. Confirm this describes a real problem in your environment
2. Verify all version numbers and error messages
3. Remove any placeholder or example content
4. Test that the issue is reproducible
If you can't verify the issue, please close it. We're happy to help with real problems!
```
### Step 3: Take Action
**For verified issues:**
1. Remove the `potential-ai-generated` label
2. Keep or remove `needs-triage` as appropriate
3. Proceed with normal issue triage
**For unverified/invalid issues:**
1. Request verification (see templates above)
2. If no response after 7 days, consider closing as `stale`
3. If clearly invalid, close with explanation
## Common Patterns
### False Positives (Legitimate Issues)
These may trigger the detector but are usually valid:
- **Very detailed bug reports** with extensive logs and testing
- **Technical design proposals** with multiple sections
- **Well-organized feature requests** with tables and examples
**Action**: Engage with contributor, ask clarifying questions, remove flag if verified.
### True Positives (AI-Generated)
Red flags that indicate unverified AI content:
- **Future version numbers**: "Kubernetes 1.33" (doesn't exist yet)
- **Future dates**: "2026-01-27" (if current date is before)
- **Non-existent features**: References to Velero features that don't exist
- **Generic solutions**: "Auto-detect available port" (not how Velero works)
- **Perfect formatting, wrong content**: Beautiful tables with incorrect info
**Action**: Request verification, ask for actual environment details, consider closing if unverified.
### Edge Cases
**Contributor using AI as a writing assistant:**
- Issue content is verified and accurate
- Just used AI to help structure/format the report
- **Action**: This is acceptable! Remove flag if content is verified.
**Legitimate issue that happens to match patterns:**
- Real problem with detailed analysis
- Includes proper version numbers and logs
- **Action**: Verify with contributor, remove flag once confirmed.
## Statistics and Monitoring
You can search for flagged issues:
```
is:issue label:potential-ai-generated
```
Monitor trends:
- High detection rate → May need to adjust thresholds
- Low detection rate → Patterns working well or need refinement
## Adjusting the System
### Modifying Detection Patterns
Edit `.github/workflows/ai-issue-detector.yml`:
```javascript
// Increase threshold to reduce false positives
if (aiScore >= 4) { // was 3
// Adjust pattern sensitivity
excessiveTables: (issueBody.match(/\|.*\|/g) || []).length > 8, // was 5
```
### Adding New Patterns
Add to the `aiPatterns` object:
```javascript
// Example: Detect excessive use of emojis
excessiveEmojis: (issueBody.match(/[\u{1F300}-\u{1F9FF}]/gu) || []).length > 10,
```
### Disabling the Workflow
Rename or delete `.github/workflows/ai-issue-detector.yml`
## Best Practices
1. **Be courteous**: Contributors may not realize their AI tool generated incorrect info
2. **Verify, don't assume**: Some detailed issues are legitimate
3. **Educate**: Point to the AI guidelines in code-standards.md
4. **Track patterns**: Note common AI-generated patterns for future improvements
5. **Iterate**: Adjust detection thresholds based on false positive rates
## FAQ
**Q: Should we reject all AI-assisted contributions?**
A: No! AI assistance is fine if the contributor verifies accuracy. We only flag unverified AI content.
**Q: What if a contributor is offended by the flag?**
A: Explain it's automated and not personal. We just need verification of technical details.
**Q: Can we automatically close flagged issues?**
A: No. Always engage with the contributor first. Some are legitimate.
**Q: What's an acceptable false positive rate?**
A: Aim for <10%. If higher, increase the threshold from 3 to 4 or 5.
## Support
Questions about the AI detection system? Tag @vmware-tanzu/velero-maintainers in issue #9501.

1
.github/labels.yaml vendored
View File

@@ -41,3 +41,4 @@ kind:
- tech-debt
- usage-error
- voting
- potential-ai-generated

132
.github/workflows/ai-issue-detector.yml vendored Normal file
View File

@@ -0,0 +1,132 @@
name: "Detect AI-Generated Issues"
on:
issues:
types: [opened, edited]
jobs:
detect-ai-content:
runs-on: ubuntu-latest
permissions:
issues: write
contents: read
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Analyze issue for AI-generated content
id: analyze
uses: actions/github-script@v7
with:
script: |
const issue = context.payload.issue;
const issueBody = issue.body || '';
const issueTitle = issue.title || '';
// AI detection patterns
const aiPatterns = {
// Overly structured markdown with extensive tables
excessiveTables: (issueBody.match(/\|.*\|/g) || []).length > 5,
// Multiple consecutive headers with consistent formatting
excessiveHeaders: (issueBody.match(/^#{1,6}\s+/gm) || []).length > 8,
// Overly formal language patterns common in AI
formalPhrases: [
'Root Cause Analysis',
'Operational Impact',
'Expected Permanent Solution',
'Questions for Maintainers',
'Labels and Metadata',
'Reference Files',
'Steps to Reproduce'
].filter(phrase => issueBody.includes(phrase)).length > 4,
// Excessive use of emojis or special characters
excessiveFormatting: issueBody.includes('---\n \n---') ||
(issueBody.match(/---/g) || []).length > 4,
// Unrealistic version numbers or dates in the future
futureDates: /202[6-9]|203\d/.test(issueBody),
// Overly detailed technical specs with perfect formatting
perfectFormatting: issueBody.includes('| Parameter | Value |') &&
issueBody.includes('| Aspect | Status | Impact |'),
// Generic AI-style section headers
aiSectionHeaders: [
'## Description',
'## Critical Problem',
'## Affected Environment',
'## Full Helm Configuration',
'## Resolution Attempts',
'## Related Information'
].filter(header => issueBody.includes(header)).length > 4,
// Unusual specificity combined with generic solutions
genericSolutions: issueBody.includes('auto-detect') &&
issueBody.includes('configuration:') &&
(issueBody.match(/```yaml/g) || []).length > 2
};
// Calculate AI score
let aiScore = 0;
let detectedPatterns = [];
for (const [pattern, detected] of Object.entries(aiPatterns)) {
if (detected) {
aiScore++;
detectedPatterns.push(pattern);
}
}
console.log('AI Score: ' + aiScore + '/8');
console.log('Detected patterns: ' + detectedPatterns.join(', '));
// If AI score is high, add label and comment
if (aiScore >= 3) {
// Add label
try {
await github.rest.issues.addLabels({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issue.number,
labels: ['needs-triage', 'potential-ai-generated']
});
// Add comment
const confidence = Math.round(aiScore/8 * 100);
const repoPath = context.repo.owner + '/' + context.repo.repo;
const comment = '👋 Thank you for opening this issue!\n\n' +
'This issue has been flagged for review as it may contain AI-generated content (confidence: ' + confidence + '%).\n\n' +
'**Detected patterns:** ' + detectedPatterns.join(', ') + '\n\n' +
'If this issue was created with AI assistance, please review our [AI contribution guidelines](https://github.com/' + repoPath + '/blob/main/site/content/docs/main/code-standards.md#ai-generated-content).\n\n' +
'**Important:**\n' +
'- Please verify all technical details are accurate\n' +
'- Ensure version numbers, dates, and configurations reflect your actual environment\n' +
'- Remove any placeholder or example content\n' +
'- Confirm the issue is reproducible in your environment\n\n' +
'A maintainer will review this issue shortly. If this was flagged in error, please let us know!';
await github.rest.issues.createComment({
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: issue.number,
body: comment
});
core.setOutput('ai-detected', 'true');
core.setOutput('ai-score', aiScore);
} catch (error) {
console.log('Error adding label or comment:', error);
}
} else {
core.setOutput('ai-detected', 'false');
core.setOutput('ai-score', aiScore);
}
return {
aiDetected: aiScore >= 3,
score: aiScore,
patterns: detectedPatterns
};

View File

@@ -42,6 +42,46 @@ A command to do this is `make new-changelog CHANGELOG_BODY="Changes you have mad
If a PR does not warrant a changelog, the CI check for a changelog can be skipped by applying a `changelog-not-required` label on the PR. If you are making a PR on a release branch, you should still make a new file in the `changelogs/unreleased` folder on the release branch for your change.
## AI-Generated Content
We welcome contributions from all developers, including those who use AI tools to assist in their work. However, to maintain code quality and ensure contributions are accurate and appropriate, please follow these guidelines:
### Using AI Assistance
**Acceptable use:**
- Using AI tools (like GitHub Copilot, ChatGPT, Claude, etc.) to generate scaffolding or boilerplate code
- Getting AI assistance for writing unit tests
- Using AI to help understand complex code patterns
- AI-assisted debugging and problem-solving
- Using AI to help with documentation writing
**Requirements when using AI:**
1. **Always review and verify** AI-generated content before submitting
2. **Test thoroughly** - ensure the code works as expected in your environment
3. **Verify technical accuracy** - check that all version numbers, configurations, and technical details are correct
4. **Remove placeholders** - ensure there are no example or placeholder content
5. **Understand the code** - be able to explain and defend your changes during code review
6. **Disclose AI usage** - if a significant portion of your PR was AI-generated, mention it in the PR description
### What to Avoid
**Unacceptable practices:**
- Submitting entirely AI-generated PRs or issues without review or verification
- Including hallucinated information (false version numbers, non-existent APIs, etc.)
- Copying AI-generated content with placeholder or example data
- Submitting AI-generated issues describing problems you haven't actually experienced
- Using AI to generate issues about features or bugs without verifying they exist
### For Issues
When creating issues with AI assistance:
- Ensure the issue describes a **real problem** you have experienced
- Verify all version numbers, error messages, and configurations are from your actual environment
- Remove any AI-generated boilerplate or overly formal structure
- Focus on clarity and accuracy over comprehensive formatting
Issues that appear to be entirely AI-generated without proper verification may be labeled as `potential-ai-generated` and flagged for additional review.
## Copyright header
Whenever a source code file is being modified, the copyright notice should be updated to our standard copyright notice. That is, it should read Copyright the Velero contributors.”