AI & Automation

AI Automation for Content Publishers and News Sites

Automate your content pipeline — from scraping source websites to formatting, moderating, and publishing articles to WordPress.

How to automate the content pipeline for news sites and publishers — from monitoring source websites and scraping articles to AI processing, moderation, and publishing to WordPress.

13 min read|March 30, 2026
Content AutomationNews PublishingWordPress

The manual content pipeline is killing your newsroom

If you run a local news site or content aggregation business, you already know the drill. Someone on your team — maybe a junior editor, maybe an intern — spends hours each day visiting government press release pages, PR Newswire, local police blotters, school district announcements, and competitor outlets. They copy text, reformat it for your CMS, add a headline, pick a category, and submit it for review. Multiply that by twenty sources and three publications, and you've got a full-time job that's nothing but copy-paste.

This workflow made sense in 2015. It doesn't anymore. The same pipeline — monitor sources, grab content, process it, review it, publish it — can be automated with a combination of web scraping, AI processing, and CMS integration. Not as a replacement for your editorial team, but as a system that does the mechanical work so your editors can focus on original reporting and editorial judgment.

We've built these systems for publishers through our AI systems and automation services. The pattern is consistent: a newsroom spending 30+ hours a week on content aggregation drops that to 3-5 hours of editorial review after the system is in place. The content quality stays the same or improves, because the AI handles formatting consistently and editors spend their time on substance instead of structure.

This article walks through the full architecture — from monitoring source websites to publishing finished articles in WordPress — with enough technical detail to evaluate whether it's worth building for your operation.

How the pipeline fits together

The content automation pipeline has five stages. Each one runs independently, which means you can build and test them separately before connecting everything.

Stage 1: Source monitoring. The system watches a defined list of web pages, RSS feeds, or email inboxes for new content. When something new appears, it triggers the next stage.

Stage 2: Scraping. The system fetches the full article text, PDF document, or press release from the source. This includes handling paywalls, teaser blocks, and pagination.

Stage 3: AI processing. The scraped content gets sent to a language model (Claude or GPT-4) with instructions to reformat it, generate a headline, write a summary, and assign categories and tags.

Stage 4: Moderation. The processed article lands in a review dashboard where a human editor can approve, edit, or reject it before it goes live.

Stage 5: CMS publish. Approved articles get pushed to WordPress (or any CMS with an API) as draft or published posts, complete with categories, tags, and featured images.

This five-stage pipeline maps directly to how newsrooms already work. The difference is that stages 1 through 3 happen automatically, and your staff only touches stages 4 and 5. If you're evaluating how automation fits into your business operations more broadly, our breakdown of what an AI revenue system actually does covers the same pattern applied to sales and lead management.

The stack we typically recommend: Python (with BeautifulSoup and Playwright) or Node.js (with Puppeteer) for scraping, n8n or Make.com for orchestration, Claude's API for content processing, and the WordPress REST API for publishing. Each piece is replaceable — the architecture matters more than the specific tools.

Monitoring sources for new content

The first stage is knowing when new content exists. You need to define your source list and decide how to watch each source.

RSS feeds are the easiest. Many government agencies, wire services, and news outlets still publish RSS feeds. Your system subscribes to these feeds and checks them on a schedule — every 15 minutes, every hour, or whatever frequency matches your editorial pace. When a new item appears in the feed, the system extracts the URL and passes it to the scraping stage. Python's feedparser library handles RSS parsing in a few lines of code. Node.js has rss-parser. Both are reliable.

For sources without RSS feeds, you need web scraping with change detection. The system loads a web page, extracts the list of articles or press releases, and compares it against what it saw last time. New items get queued for scraping. This comparison can be as simple as storing a list of known URLs in a database and flagging any URL that hasn't been seen before.

Some sources send content via email — PR agencies, government mailing lists, syndication partners. For these, your system monitors a dedicated inbox (Gmail API or IMAP polling) and extracts article content from the email body or attached documents.

The source monitoring layer needs a simple database table: source URL, check frequency, last checked timestamp, and a list of already-processed article URLs. This prevents duplicate processing and gives you a clear audit trail. If you're building on GoHighLevel, the platform's webhook triggers can feed into this monitoring layer for sources that push content to you rather than requiring you to pull it.

Set realistic check frequencies. For breaking news sources, check every 10-15 minutes during business hours. For weekly press release pages, once a day is plenty. Over-polling wastes resources and can get your IP blocked by source sites. Under-polling means you miss time-sensitive content. Most local news operations settle on 15-30 minute intervals for their primary sources.

Scraping full articles from source sites

Once you know a new article exists, you need to get the full text. This is where things get technically interesting.

Simple scraping works for most government and institutional sites. You send an HTTP request to the article URL, parse the HTML response, and extract the article body. Python's requests library plus BeautifulSoup handles this for 70-80% of sources. You write a CSS selector or XPath expression that targets the main content container on each source site, and the parser pulls out the text.

The catch is that every source site has a different HTML structure. A press release on a county government website lives in a different DOM element than an article on a regional wire service. You'll need a source configuration that maps each source to its content selectors. This is a one-time setup per source — define the CSS selector for the article body, the headline element, the publication date, and the author field. Store these configurations in a JSON file or database table.

JavaScript-rendered pages require a headless browser. Some sites load their content dynamically after the initial page load, which means a simple HTTP request returns an empty content container. For these sources, you use Playwright (Python) or Puppeteer (Node.js) to launch a headless browser, wait for the page to fully render, and then extract the content from the rendered DOM. This is slower and more resource-intensive than simple HTTP requests, so only use it for sources that actually need it.

PDFs and press releases in document form require a different approach. Python's PyPDF2 or pdfplumber can extract text from most PDFs, though formatting gets messy with multi-column layouts and embedded tables. For scanned PDFs (images of text rather than actual text), you need OCR — Tesseract is the standard open-source option, though cloud OCR services from AWS (Textract) or Google (Document AI) produce better results on complex layouts.

Press release PDFs from government agencies tend to follow consistent templates. Once you've figured out the extraction pattern for one PDF from a given agency, the same pattern usually works for all of them. Build template-specific extractors and map them to their sources.

Rate limiting and politeness matter. Space out your requests to any single domain — one request every 2-5 seconds is a reasonable default. Rotate user agents if you're scraping at scale. Respect robots.txt files. If a source site asks you to stop scraping, stop. The relationships between your publication and your sources matter more than any automation.

AI processing and content transformation

Raw scraped content is rarely ready to publish. It needs reformatting, a new headline, a summary, category assignments, and sometimes rewriting to match your publication's style. This is where a language model earns its keep.

The processing step sends the scraped article text to Claude or GPT-4 with a structured prompt. The prompt defines exactly what you want back: a reformatted article body, a headline under 70 characters, a two-sentence summary for social sharing, a primary category (from a predefined list), relevant tags, and a suggested featured image search query.

Here's what a production prompt structure looks like for a local news publisher:

You give the AI model the raw article text and a system prompt that says: "You are an editor for [Publication Name], a local news site covering [Region]. Reformat the following press release into a news article suitable for our readers. Keep all factual claims and quotes intact. Do not add information that isn't in the source. Write in AP style. Generate a headline under 70 characters. Assign one primary category from this list: [Local Government, Public Safety, Education, Business, Community Events, Health, Transportation]. Suggest up to five tags."

The model returns structured output — ideally as JSON — that your system can parse and route to the next stage. Using Claude's structured output feature or GPT-4's function calling ensures the response follows your expected format every time.

Processing nuances matter. For press releases, the AI strips the corporate boilerplate (the "About Company X" footer, the media contact block, the legal disclaimers) and converts the self-promotional tone into straightforward reporting language. For police blotters and court records, the AI applies standard privacy practices — full names for public officials, initials or omission for minors, clear attribution of charges as allegations.

If you're processing content at volume — 50+ articles per day — batching your AI calls saves money and reduces API latency. Group articles by source or category, process them in batch, and review the results together. Our guide on building AI-powered follow-up workflows covers similar batching strategies applied to sales messages, but the same principles apply to content processing.

For publishers running AI processing across multiple content types, machine learning pipelines can handle classification and categorization without per-article API calls. You train a lightweight classifier on your existing article archive, and it assigns categories and tags locally. Save the API calls for the actual content transformation and headline generation, where a large language model's capabilities justify the cost.

Why human review is non-negotiable

Here's where some publishers get tempted to cut corners. If the AI can process the article, why not publish it directly? Skip the moderation step and you can post content within minutes of it appearing on a source site.

Don't do this. Human editorial review before publishing is non-negotiable, for three reasons.

Accuracy. AI models can misinterpret source material. They occasionally rephrase a sentence in a way that changes its meaning. A press release saying a city council "considered" a new ordinance might become an article saying the council "approved" it. That kind of error damages your credibility with readers and your relationship with local government sources. An editor catches this in 30 seconds. An unsupervised AI won't catch it at all.

Legal exposure. Publishing scraped and reformatted content without editorial review creates legal risk. Copyright claims, defamation liability, and privacy violations all become your problem the moment content goes live on your domain. An editor verifies that the article falls within fair use or your syndication agreement, that quotes are accurately attributed, and that the content doesn't include information that shouldn't be public (sealed court records, juvenile names, victim identities in certain crime categories).

Editorial judgment. Not every press release deserves publication. Not every police report should become a news article. Your editors make judgment calls about newsworthiness, timing, and sensitivity that AI can't replicate. A school district press release about a new reading program might be worth covering. The same district's press release about their new logo probably isn't. These decisions define your publication's editorial identity.

The moderation layer should be a simple web dashboard. Processed articles appear in a queue with their generated headlines, summaries, categories, and the original source link. Editors can approve (publish as-is), edit (modify before publishing), or reject (discard). Each action takes one click plus optional edits. Build this as a lightweight web app using any framework your team is comfortable with — React, Vue, or even a simple server-rendered page.

Track approval rates and rejection reasons. If the AI consistently generates headlines that editors rewrite, adjust your prompt. If certain sources produce content that gets rejected 80% of the time, remove them from your source list. This feedback loop is how the system improves over time. If you want to understand how we approach building internal tools like this review dashboard, our web development services page covers our process.

Publishing to WordPress via REST API

WordPress powers roughly 40% of the web, and most local news sites run on it. The WordPress REST API lets your automation system create posts programmatically — complete with titles, content, categories, tags, featured images, and custom fields.

The publish step takes an approved article from your moderation dashboard and sends a POST request to yoursite.com/wp-json/wp/v2/posts. The request body includes the title, content (as HTML), status (draft or publish), category IDs, tag IDs, and any custom fields your theme requires.

Authentication uses application passwords (built into WordPress since version 5.6) or JWT tokens. Application passwords are simpler to set up: create one in the WordPress admin under Users → Security, and pass it as a Basic Auth header with your API requests.

Categories and tags need to match your WordPress taxonomy. Before creating a post, your system should look up the category ID for "Local Government" or "Public Safety" from the WordPress API. Cache these IDs locally so you're not making extra API calls for every article. If a tag doesn't exist yet, the API can create it on the fly.

Featured images require a two-step process. First, upload the image to the WordPress media library using the /wp/v2/media endpoint. The API returns a media ID. Then set that media ID as the featured_media field on your post. If your AI processing step generates an image search query, you can integrate with Unsplash's API or Pexels to find a relevant royalty-free image, download it, and upload it to WordPress automatically.

For publishers running multiple WordPress sites, the system posts the same article (or variants of it) to each site's API. Each site gets its own API credentials and category mapping. This is where the architecture pays off — the scraping and AI processing happen once, and only the publish step multiplies across sites.

Custom post types and Advanced Custom Fields (ACF) work through the REST API too, though ACF requires the ACF to REST API plugin to expose custom field endpoints. If your theme uses custom fields for bylines, source attribution, or article types, these get included in the API request body.

Test your WordPress integration with draft posts first. Set the post status to "draft" and verify that titles, content formatting, categories, and images all look correct before switching to auto-publish. You don't want to discover a formatting bug after 50 malformed articles have gone live. This kind of staged deployment is central to how we approach automation projects — our process page breaks down how we validate systems before they touch production data.

Running one system across multiple publications

Media companies that operate multiple local news sites get the most value from content automation. Instead of each site's editor independently monitoring the same government press release pages, one system handles all the monitoring, scraping, and AI processing centrally. Each site's editors only see content relevant to their coverage area.

The multi-site architecture adds a routing layer between AI processing and moderation. After the AI categorizes an article by topic and geographic region, the routing layer assigns it to the appropriate site's moderation queue. A county-level press release about road construction goes to the site covering that county. A state-level policy announcement goes to all sites in the state. A school district announcement goes only to the site covering that district's area.

Geographic routing uses a mapping table: source → geographic coverage area → publication. When you add a new source, you tag it with its geographic scope. When you add a new publication, you define its coverage area. The routing logic is just a lookup.

This approach also handles content variants. Some articles need different headlines or ledes for different publications. The AI can generate site-specific variants in the processing step — same facts, different framing for different audiences. A county government budget story might lead with property tax impacts for a residential-focused site and with infrastructure spending for a business-focused one.

Shared moderation across sites is optional. Some media companies prefer a single editorial team reviewing content for all publications. Others give each site's editor their own queue. The system supports both models — it's just a filter on the moderation dashboard.

Building the system step by step

Here's how to build this from scratch, assuming you have a developer who's comfortable with Python or Node.js.

Week one: Source audit and scraping. List every source your newsroom currently monitors manually. For each source, document the URL, update frequency, content format (HTML article, PDF, RSS feed), and the CSS selectors needed to extract the article body. Build the scraping layer for your top 10 sources. Test each scraper against five real articles to verify the extraction works cleanly.

Week two: AI processing pipeline. Set up the AI processing step. Write your system prompt, defining your publication's style, categories, and output format. Process 20 scraped articles and review the output manually. Refine your prompt based on what the AI gets wrong — headlines too long, categories misassigned, source attribution missing. Iterate until the processing output is good enough that an editor only needs minor tweaks.

Week three: Moderation dashboard and WordPress integration. Build the review interface. This doesn't need to be fancy — a table of pending articles with approve/edit/reject buttons, plus a preview pane showing the article as it will appear on the site. Connect the approve action to the WordPress REST API so approved articles get posted automatically.

Week four: Orchestration and monitoring. Connect all three stages using n8n or Make.com. Set up the scheduling — which sources get checked when. Add error alerting so your team knows when a scraper breaks (source sites redesign, URLs change, rate limits get hit). Run the full pipeline end-to-end and fix whatever breaks.

If you don't have a developer on staff, this is exactly the kind of project we build through our AI systems and automation practice. We handle the technical build and hand off a working system your editorial team can operate independently.

For publishers already using GoHighLevel or similar platforms for their marketing automation, the content pipeline can share infrastructure. Lead capture workflows and content publishing workflows both benefit from the same orchestration tools. Our work with GoHighLevel automation often includes content distribution alongside sales and marketing workflows.

After the initial build, maintenance is light. You'll add new sources as you discover them (30-60 minutes per source for scraper setup). You'll adjust AI prompts when editorial standards change. And you'll fix scrapers when source sites redesign — which happens a few times a year per source, not daily. The system runs in the background, and your editors interact with a clean review queue rather than a pile of browser tabs.

For publishers looking at broader automation beyond content — AI agent development for reader engagement, automated ad sales workflows, subscriber management — the content pipeline is usually the first project because it shows clear ROI and builds confidence in automation across the newsroom.

Ready to stop paying people to copy and paste? Get in touch and we'll scope a content automation system for your operation.

Frequently asked questions

How much does an AI content automation system cost to run?

The main variable costs are AI API usage and hosting. Claude or GPT-4 API calls for processing articles typically cost $0.01-0.05 per article depending on length and the model used. Hosting for the scraping and orchestration layer runs $20-50/month on a basic cloud server. WordPress hosting is whatever you're already paying. For a newsroom processing 50-100 articles per day, expect $50-150/month in total variable costs, not counting the initial development investment.

Can the AI write original articles, not just reformat press releases?

Technically yes, but that's a different use case with different risks. This system is designed for processing and reformatting existing content — press releases, public records, wire service stories. Writing original journalism from scratch requires original reporting, source interviews, and editorial judgment that AI can't replicate. The system frees up your reporters' time to do that original work instead of spending it on reformatting.

What happens when a source website changes its layout?

The scraper for that source will break and return an error or empty content. Your monitoring system should alert your team when a scraper fails. Fixing it means updating the CSS selectors in your source configuration to match the new site structure. This usually takes 15-30 minutes per source and happens a few times per year for any given site.

How do I handle sources that require a login or subscription?

If you have legitimate access credentials (a press portal login, a wire service subscription), your scraper can authenticate programmatically using the same credentials. Store them securely — environment variables or a secrets manager, never hardcoded. If you don't have authorized access, don't scrape behind the login wall. That crosses legal and ethical lines.

Is it legal to scrape and republish government press releases?

Government press releases in the United States are generally public domain works, meaning they aren't protected by copyright. However, this varies by jurisdiction and agency. State and local government works may or may not be in the public domain depending on the state's laws. Always attribute the source, and consult a media attorney if your operation is large enough to justify the cost.

Can this system work with CMS platforms other than WordPress?

Yes. Any CMS with a REST API or webhook support can receive content from this system. Drupal, Ghost, Webflow, and custom-built systems all work. The publish step just needs to know the API endpoint, authentication method, and data format for the target CMS. WordPress is the most common and best-documented option, which is why it gets the focus here.

How long does it take to set up the full pipeline?

For a developer experienced with Python or Node.js and API integrations, expect three to four weeks for a production-ready system covering 10-20 sources and one publication. Adding more sources is incremental after the initial build. Multi-site setups add another one to two weeks for routing and per-site configuration.

What's the difference between this and using an RSS reader?

An RSS reader shows you new content. This system fetches full articles, processes them with AI to match your publication's format and style, runs them through editorial review, and publishes them to your CMS with proper categories and metadata. It automates the entire workflow from discovery to publication, not just the discovery step.

Related Articles

Need Help Implementing This?

Our team at Luminous Digital Visions specializes in SEO, web development, and digital marketing. Let us help you achieve your business goals.

Get Free Consultation