This commit is contained in:
2025-12-30 15:44:54 +01:00
parent 4fb120723d
commit 10a5da4cc3
5 changed files with 121 additions and 28 deletions

View File

@@ -22,24 +22,23 @@ This scrapes unchecked URLs from `scraping.md`, saves JSON to `scraped_data/`, a
### 2. Process scraped data into docs/
```bash
# Process one file (can run multiple in parallel)
# Process 1 file (default)
venv/bin/python python/process.py
# Process multiple in parallel (e.g., 4 at once)
for i in {1..4}; do venv/bin/python python/process.py & done; wait
# Process 5 files
venv/bin/python python/process.py -n 5
# Process all remaining files (4 parallel workers)
while venv/bin/python python/process.py; do :; done &
while venv/bin/python python/process.py; do :; done &
while venv/bin/python python/process.py; do :; done &
while venv/bin/python python/process.py; do :; done &
wait
# Process all remaining files
venv/bin/python python/process.py -n 9999
# Process in parallel (e.g., 4 workers processing 10 files each)
for i in {1..4}; do venv/bin/python python/process.py -n 10 & done; wait
```
The script uses file locking to safely run in parallel. Each invocation:
1. Claims one pending JSON file from `processed.md`
2. Calls Claude to parse it into the `docs/` folder structure
3. Marks it as completed
1. Claims pending JSON files from `processed.md`
2. Calls Claude to parse them into the `docs/` folder structure
3. Marks them as completed
---