A multi-source content pipeline that scrapes career/workplace topics from HN, Reddit, and newsletters, then uses AI to generate blog post drafts with human review before publishing.
- Daily (automated): Scrapes HN, Reddit, and newsletters for career/workplace topics
- Daily (automated): Extracts themes and scores topics using AI
- Twice weekly (automated): Generates a blog post draft with critique loop
- Human review: You review the PR, pick a headline, edit, and merge
- Draft posts appear as GitHub PRs in
src/content/beyondthecode/ - PR description includes headline options, quality scores, and pull quotes
- You edit and merge when ready to publish
┌─────────────────────────────────────────────────────────────────┐
│ DAILY (00:00 UTC) │
│ btc-scrape.yml workflow │
├─────────────────────────────────────────────────────────────────┤
│ hn_scraper_btc.py → data/hn_nontech_{date}.json │
│ reddit_scraper.py → data/reddit_{date}.json │
│ newsletter_monitor.py → data/newsletters_{date}.json │
│ topic_extractor.py → data/topic_bank.json │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ TWICE WEEKLY (Mon/Thu 08:00 UTC) │
│ btc-generate.yml workflow │
├─────────────────────────────────────────────────────────────────┤
│ content_generator.py: │
│ 1. Select best unused topic │
│ 2. Generate outline (Claude) │
│ 3. Critique outline (Gemini) │
│ 4. Generate draft (Claude) │
│ 5. Critique draft (Gemini) │
│ 6. Apply revisions (Claude) │
│ 7. Generate headline options │
│ 8. Create PR as draft │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ HUMAN REVIEW │
├─────────────────────────────────────────────────────────────────┤
│ 1. Review PR │
│ 2. Pick headline from options │
│ 3. Edit draft as needed │
│ 4. Mark ready & merge → Published │
└─────────────────────────────────────────────────────────────────┘
Go to: Repository → Settings → Secrets and variables → Actions → New repository secret
Add these secrets:
| Secret Name | Required | Description | How to Get |
|---|---|---|---|
ANTHROPIC_API_KEY |
Yes* | Claude API key | console.anthropic.com |
GOOGLE_API_KEY |
Yes | Gemini API key (free tier) | aistudio.google.com |
OPENAI_API_KEY |
Yes* | OpenAI API key (fallback for Anthropic) | platform.openai.com |
GROQ_API_KEY |
No | Groq API key (free fallback) | console.groq.com |
REDDIT_CLIENT_ID |
Yes | Reddit OAuth app client ID | reddit.com/prefs/apps - Create "script" app |
REDDIT_CLIENT_SECRET |
Yes | Reddit OAuth app secret | Same as above - shown after creating app |
*Either ANTHROPIC_API_KEY or OPENAI_API_KEY is required. OpenAI serves as fallback when Anthropic is unavailable.
Reddit App Setup: Go to reddit.com/prefs/apps → "create another app" → Select "script" type → Set redirect URI to http://localhost:8080 → Note the client ID (under app name) and secret.
cd /Users/gpagade/personal-code/rockoder.github.io
pip install -r scripts/requirements.txtCopy the example file and fill in your keys:
cp .env.example .env
# Edit .env with your favorite editorOr export variables directly:
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="AIza..."
export OPENAI_API_KEY="sk-..." # fallback for Anthropic
export GROQ_API_KEY="gsk_..." # optional
export REDDIT_CLIENT_ID="..." # Reddit OAuth
export REDDIT_CLIENT_SECRET="..." # Reddit OAuthThe .env file is already gitignored and will be automatically loaded by run_pipeline.py.
git add .
git commit -m "Add Beyond the Code content pipeline"
git pushOnce set up, the pipeline runs automatically:
| Time | What Happens |
|---|---|
| Daily 00:00 UTC | Scrapers run, topics extracted, data committed |
| Mon/Thu 08:00 UTC | Draft generated, PR created |
-
Check for new PRs (Mon/Thu afternoons)
- Look for PRs titled
[Draft] ...
- Look for PRs titled
-
Review the PR
- Read the draft in
src/content/beyondthecode/ - Check quality scores in PR description
- Look at pull quote candidates
- Read the draft in
-
Pick a headline
- PR description has 5 headline options
- Update the
title:in the frontmatter
-
Edit the draft
- Fix any issues
- Add personal touches
- Verify voice consistency
-
Merge when ready
- Mark PR as ready for review
- Merge to master
- Site deploys automatically
The easiest way to run the pipeline locally is using the unified runner script:
# 1. Install dependencies
pip install -r scripts/requirements.txt
# 2. Set up environment variables
cp .env.example .env
# Edit .env and add your API keys
# 3. Run full pipeline in dry-run mode (no PR, no git changes)
python scripts/run_pipeline.py --all --dry-runDry-run mode saves drafts to data/drafts/ instead of creating PRs, and doesn't modify the topic bank.
The unified runner script (scripts/run_pipeline.py) provides a single interface:
# Check environment variables
python scripts/run_pipeline.py --check-env
# Run full pipeline (dry-run - safe for testing)
python scripts/run_pipeline.py --all --dry-run
# Run only scrapers
python scripts/run_pipeline.py --scrape
# Run only topic extraction (requires scraped data)
python scripts/run_pipeline.py --extract
# Run only content generation (requires topics in bank)
python scripts/run_pipeline.py --generate --dry-run
# Run full pipeline and create actual PR
python scripts/run_pipeline.py --allOptions:
--dry-run: Save draft locally instead of creating PR--skip-topic-update: Don't mark topic as used (for repeated testing)--check-env: Only check environment variables, don't run anything--fail-fast: Stop on first error (default: continue on error)
Every content generation run saves intermediate results for debugging and prompt improvement:
data/debug/2026-02-22_143022/
├── 01_topic.json # Selected topic
├── 02_outline.md # Generated outline
├── 03_outline_critique.json # Outline critique scores
├── 04_draft.md # Initial draft
├── 05_draft_critique.json # Draft critique scores
├── 06_draft_revised.md # Final draft (after revisions)
├── 07_headlines.json # Headline options
└── 08_series_info.json # Series detection result
This output is saved regardless of --dry-run mode and is gitignored.
# Just scrape HN
python scripts/hn_scraper_btc.py
# Output: data/hn_nontech_2026-02-19.json
# Just scrape Reddit
python scripts/reddit_scraper.py
# Output: data/reddit_2026-02-19.json
# Just check newsletters
python scripts/newsletter_monitor.py
# Output: data/newsletters_2026-02-19.json
# Just extract topics (needs scraped data first)
python scripts/topic_extractor.py
# Output: Updates data/topic_bank.json
# Just generate content (needs topics in bank)
python scripts/content_generator.py --dry-run
# Output: Saves draft to data/drafts/
# Generate content and create PR
python scripts/content_generator.py
# Output: Creates PR with draftpython scripts/content_generator.py --help
Options:
--dry-run Save draft locally instead of creating PR
--skip-topic-update Don't mark topic as used (for testing)# Trigger daily scrape
gh workflow run btc-scrape.yml
# Trigger content generation
gh workflow run btc-generate.yml
# Check workflow status
gh run list --workflow=btc-scrape.yml
gh run list --workflow=btc-generate.ymlEdit config/models.yaml:
models:
draft_writing:
provider: "anthropic"
model: "claude-sonnet-4-20250514" # Change to claude-3-haiku for cheaper
fallback:
provider: "anthropic"
model: "claude-3-haiku-20240307"Available providers: anthropic, openai, gemini, groq
Available models:
- anthropic:
claude-sonnet-4-20250514,claude-3-haiku-20240307 - openai:
gpt-4o,gpt-4o-mini,gpt-4-turbo - gemini:
gemini-2.0-flash,gemini-1.5-pro - groq:
llama-3.1-70b-versatile
Add Reddit subreddits - Edit scripts/reddit_scraper.py:
SUBREDDITS = [
"experienceddevs",
"cscareerquestions",
"managers",
"yourNewSubreddit", # Add here
]Add newsletter feeds - Edit scripts/newsletter_monitor.py:
RSS_FEEDS = {
"new_feed": {
"name": "New Newsletter",
"url": "https://example.com/feed.xml",
"focus": ["topic1", "topic2"]
},
# ... existing feeds
}Add HN keywords - Edit scripts/hn_scraper_btc.py:
NONTECH_KEYWORDS = [
"career", "promotion", ...,
"your_new_keyword", # Add here
]Edit the prompt templates in scripts/prompts/:
| File | Controls |
|---|---|
outline.txt |
Post structure, section headers, named patterns |
draft.txt |
Writing style, formatting, voice requirements |
critique.txt |
Quality criteria, scoring rubric |
Edit the cron expressions in .github/workflows/:
# btc-scrape.yml - Currently daily at midnight UTC
schedule:
- cron: '0 0 * * *' # Change as needed
# btc-generate.yml - Currently Mon/Thu at 8am UTC
schedule:
- cron: '0 8 * * 1,4' # Change days/time as neededCron format: minute hour day-of-month month day-of-week
The topic bank is empty or all topics are used.
# Check topic bank status
python -c "import json; d=json.load(open('data/topic_bank.json')); print(f'Total: {len(d[\"topics\"])}, Unused: {len([t for t in d[\"topics\"] if not t.get(\"used\")])}')"
# Run scrapers to get fresh content
python scripts/hn_scraper_btc.py
python scripts/reddit_scraper.py
python scripts/newsletter_monitor.py
# Extract new topics
python scripts/topic_extractor.pyAPI key issues or rate limits.
# Test API keys
python -c "from scripts.llm_client import LLMClient; c=LLMClient(); print(c.generate('topic_extraction', 'Say hello'))"Check:
- API keys are set correctly
- You have credits/quota remaining
- The model name in
config/models.yamlis valid
Git or gh CLI issues.
# Check gh authentication
gh auth status
# Re-authenticate if needed
gh auth login --web
# Check you're on master branch
git checkout master
git pull# Check workflow logs
gh run list --workflow=btc-scrape.yml
gh run view <run-id> --log
# Or check in GitHub UI:
# Repository → Actions → Select workflow → Click failed runIf you want to start fresh:
# Backup existing
cp data/topic_bank.json data/topic_bank.backup.json
# Reset
echo '{"topics": [], "last_updated": null}' > data/topic_bank.json
# Re-run extraction
python scripts/topic_extractor.pyIf you want to regenerate a post for a topic:
import json
with open('data/topic_bank.json', 'r') as f:
bank = json.load(f)
# Find and reset the topic
for topic in bank['topics']:
if 'your search term' in topic['theme'].lower():
topic['used'] = False
print(f"Reset: {topic['theme']}")
with open('data/topic_bank.json', 'w') as f:
json.dump(bank, f, indent=2)rockoder.github.io/
├── .env.example # Template for local environment variables
├── .github/workflows/
│ ├── btc-scrape.yml # Daily scraping workflow
│ └── btc-generate.yml # Twice-weekly generation workflow
├── config/
│ └── models.yaml # LLM provider configuration
├── data/
│ ├── topic_bank.json # Persistent topic storage
│ ├── hn_nontech_*.json # Daily HN scrape results
│ ├── reddit_*.json # Daily Reddit scrape results
│ ├── newsletters_*.json # Daily newsletter results
│ ├── drafts/ # Local drafts from --dry-run mode (gitignored)
│ └── debug/ # Intermediate results for debugging (gitignored)
├── scripts/
│ ├── run_pipeline.py # Unified local runner (recommended)
│ ├── llm_client.py # Unified LLM interface
│ ├── hn_scraper_btc.py # HN non-tech scraper
│ ├── reddit_scraper.py # Reddit career subreddits
│ ├── newsletter_monitor.py # RSS feed monitor
│ ├── topic_extractor.py # AI theme extraction
│ ├── content_generator.py # Main orchestrator
│ ├── requirements.txt # Python dependencies
│ └── prompts/
│ ├── outline.txt # Outline generation prompt
│ ├── draft.txt # Draft writing prompt
│ └── critique.txt # Quality critique prompt
└── src/content/beyondthecode/
└── *.md # Generated blog posts
With the default configuration (Gemini free tier + Claude/OpenAI paid):
| Usage | Anthropic | OpenAI (fallback) |
|---|---|---|
| Topic extraction (daily) | Free (Gemini Flash) | Free (Gemini Flash) |
| Outline generation (2x/week) | ~$0.10/post (Haiku) | ~$0.05/post (GPT-4o-mini) |
| Outline critique (2x/week) | Free (Gemini Flash) | Free (Gemini Flash) |
| Draft writing (2x/week) | ~$0.50/post (Sonnet) | ~$0.40/post (GPT-4o) |
| Draft critique (2x/week) | Free (Gemini Flash) | Free (Gemini Flash) |
| Final revision (2x/week) | ~$0.30/post (Sonnet) | ~$0.25/post (GPT-4o) |
| Monthly total (8 posts) | ~$7-10 | ~$5-8 |
To reduce costs:
- Use
claude-3-haikuorgpt-4o-minifor draft writing - Switch primary provider to OpenAI if you have credits there