Migrating a Wordpress Blog to Django (Wagtail)

LEAD DATA SCIENCE

Kail Kuhn Schlicht

Problem Statement

Our company website has been running on PHP + WordPress for years, but it’s time to leave behind plugin roulette, limited extensibility, and the burden of maintaining a non-Python backend.

We wanted a single Python codebase where we could:

Keep a first‑class CMS experience for non‑technical editors.
Share one auth/session layer across future Django apps that want to implement.

Django was the obvious framework home, but which CMS layer? We compared Mezzanine, Django‑CMS, plain Django admin plus custom forms, and Wagtail. Wagtail won thanks to:

StreamField – Gutenberg‑style flexibility with structured data.
Image & media pipeline – renditions, focal‑point cropping, WebP out of the box.
Polished editor UI – non‑devs can hit Save & Publish confidently.
Lightweight architecture – everything is just Django models, views, and templates; no plugin maze.

The rest of this post shows how we migrated, step‑by‑step, while retaining the parts WordPress we liked and unlocking Django’s app‑building capabilities.

1. Why Django Wagtail Instead of Plain Django

Concern	Plain Django	Django Wagtail
Editor UI	Build your own admin or rely on stock Django admin (not CMS‑friendly).	Polished CMS dashboard, StreamField blocks, image chooser, preview.
Rich Content	Custom models + custom forms.	StreamField = Gutenberg‑style flexibility with structured data.
Media & Images	Write your own thumbnail / rendition logic.	Built‑in image renditions, focal‑point cropping, collections.
SEO / Redirects	Add 3rd‑party libs.	First‑party `wagtailseo`, `wagtailredirects`.
Menus / Snippets	Roll your own.	`wagtailmenus`, snippets chooser panels.
Upgrade cadence	Django LTS only.	Django + Wagtail LTS; Wagtail’s editor features evolve faster.

Bottom line: Wagtail adds the CMS layer so you don’t reinvent page editing, yet you still write pure Django under the hood.

2. What you need to do in Django:

Here we will cover only the essential parts that are easy to get wrong. As the most important task, you will need to declare the models which you will use to store the data coming from WordPress.

# blog/models.py

from django.db import models
from modelcluster.contrib.taggit import ClusterTaggableManager
from modelcluster.fields import ParentalKey, ParentalManyToManyField
from taggit.models import TaggedItemBase
from wagtail.admin.panels import FieldPanel, InlinePanel
from wagtail.fields import RichTextField
from wagtail.models import Page

# ----------------- Authors -----------------
class BlogAuthor(models.Model):
    wp_id = models.PositiveIntegerField(unique=True)  # maps 1‑to‑1 with wp_users.ID
    name = models.CharField(max_length=255)
    email = models.EmailField()

    panels = [FieldPanel("name"), FieldPanel("email")]

    def __str__(self):
        return self.name


# ----------------- Categories -----------------
class BlogCategory(models.Model):
    name = models.CharField(max_length=255)
    slug = models.SlugField(unique=True)

    panels = [FieldPanel("name"), FieldPanel("slug")]

    def __str__(self):
        return self.name


# ----------------- Tags -----------------
class BlogPageTag(TaggedItemBase):
    content_object = ParentalKey(
        "blog.BlogPage", related_name="tagged_items", on_delete=models.CASCADE
    )


# ----------------- Blog index -----------------
class BlogIndexPage(Page):
    intro = RichTextField(blank=True)

    content_panels = Page.content_panels + [FieldPanel("intro")]

    parent_page_types = ["home.HomePage"]
    subpage_types = ["blog.BlogPage"]

    class Meta:
        verbose_name = "Blog Index Page"


# ----------------- Individual post -----------------
class BlogPage(Page):
    date = models.DateField("Post date")
    author = models.ForeignKey(
        "blog.BlogAuthor", null=True, blank=True, on_delete=models.SET_NULL
    )
    cover_image = models.ForeignKey(
        "wagtailimages.Image", null=True, blank=True, on_delete=models.SET_NULL, related_name="+"
    )
    body = RichTextField(blank=True)
    categories = ParentalManyToManyField("blog.BlogCategory", blank=True)
    tags = ClusterTaggableManager(through=BlogPageTag, blank=True)

    content_panels = Page.content_panels + [
        FieldPanel("date"),
        FieldPanel("author"),
        FieldPanel("cover_image"),
        FieldPanel("body"),
        FieldPanel("categories"),
        FieldPanel("tags"),
        InlinePanel("post_comments", label="Comments"),
    ]

    class Meta:
        verbose_name = "Blog Post"


# ----------------- Comments -----------------
class BlogComment(models.Model):
    post = ParentalKey(
        "blog.BlogPage", related_name="post_comments", on_delete=models.CASCADE
    )
    author = models.CharField(max_length=255)
    email = models.EmailField(blank=True)
    content = models.TextField()
    approved = models.BooleanField(default=False)
    date = models.DateTimeField(auto_now_add=True)
    parent = models.ForeignKey("self", null=True, blank=True, on_delete=models.SET_NULL)

    panels = [
        FieldPanel("author"),
        FieldPanel("email"),
        FieldPanel("content"),
        FieldPanel("approved"),
    ]

    def __str__(self):
        return f"Comment by {self.author} on {self.post.title}"

Why these choices?

wp_id on BlogAuthor lets the XML import map authors flawlessly.
ParentalManyToManyField keeps category edits inline—no admin hopping.
RichTextField for body keeps import simple; later we can convert to StreamField blocks if we need richer layouts.
Comments live as a child relation so editors can moderate without leaving the post.

With models in place, we can now parse XML and hydrate these objects—next section shows the command that does exactly that.

3. XML Import Pain — Why Existing Libraries Fell Short

Before reaching for xmltodict, we trial‑ran every library we could find:

Library	Status in 2025	Deal‑Breaker
`wagtail-wordpress-import`	Alpha, opinionated	Tied to their demo models; mis‑mapped our categories
`wagtail-transfer`	Production‑ready—but Wagtail → Wagtail	Not designed for WordPress
`django-import-export`	Generic fixtures	Knows nothing about WordPress schema

After three evenings of trial‑and‑error, the verdict was clear: roll our own once, run it forever.

Export your Data from WordPress

Simply go to your WordPress Dashboard and go to “Tools” and select “All content” and click on “Downliad Export File” as shown below.

Our Custom Management Command

After wrestling with half‑maintained libraries, we wrote a single management command that:

Parses WordPress XML via ElementTree (faster, no type leaks).
Cleans messy HTML with BeautifulSoup + Bleach so Wagtail’s rich‑text import never chokes.
Imports authors, categories, tags, cover images, and nested comments in two passes (first create, then parent‑link).

# blog/management/commands/import_wordpress.py

import html
import xml.etree.ElementTree as ET
from datetime import datetime
from email.utils import parsedate_to_datetime

import bleach
from blog.models import BlogCategory, BlogComment, BlogIndexPage, BlogPage
from bs4 import BeautifulSoup
from django.core.management.base import BaseCommand
from django.utils import timezone
from django.utils.dateparse import parse_datetime
from django.utils.text import slugify

from wagtail.models import Page

# --- Extra imports for author and cover image handling
import requests
import os
from django.core.files.base import ContentFile
from wagtail.images.models import Image
from blog.models import BlogAuthor

# Namespaces for WordPress XML
ns = {
    "content": "http://purl.org/rss/1.0/modules/content/",
    "dc": "http://purl.org/dc/elements/1.1/",
    "wp": "http://wordpress.org/export/1.2/",
}

# ------------------------------------------------------------------
# Clean up WordPress HTML so Wagtail's rich‑text converter doesn't
# choke on orphan <li> tags or other malformed markup.
ALLOWED_TAGS = [
    "p",
    "br",
    "strong",
    "em",
    "a",
    "ul",
    "ol",
    "li",
    "blockquote",
    "h1",
    "h2",
    "h3",
    "h4",
    "h5",
    "h6",
    "img",
    "pre",
    "code",
    "hr",
]
ALLOWED_ATTRS = {"a": ["href", "title"], "img": ["src", "alt"]}


def clean_wp_html(raw: str) -> str:
    soup = BeautifulSoup(raw or "", "html.parser")

    # Wrap orphan <li> in <ul>
    for li in soup.find_all("li"):
        if li.parent.name not in ("ul", "ol"):
            wrapper = soup.new_tag("ul")
            li.wrap(wrapper)

    # Normalise <br> tags to self‑closing <br /> so Wagtail's converter
    # doesn't confuse an implicit </br> close.
    for br in soup.find_all("br"):
        br.attrs = {}  # strip any stray attributes

    cleaned = bleach.clean(
        str(soup),
        tags=ALLOWED_TAGS,
        attributes=ALLOWED_ATTRS,
        strip=True,
    )
    # Replace any <br></br> or <br>  pairs that slipped through with a self‑closing tag.
    cleaned = cleaned.replace("<br></br>", "<br />").replace("<br>", "<br />")
    return cleaned


# ------------------------------------------------------------------


class Command(BaseCommand):
    help = "Import WordPress blog posts and comments from XML export"

    def add_arguments(self, parser):
        parser.add_argument("xml_file", type=str, help="Path to WordPress XML file")

    def handle(self, *args, **kwargs):
        xml_file = kwargs["xml_file"]
        tree = ET.parse(xml_file)
        root = tree.getroot()

        # Preload WordPress authors
        authors_map = {}
        for author in root.findall("./channel/wp:author", ns):
            login = author.findtext("wp:author_login", namespaces=ns)
            authors_map[login] = {
                "id": author.findtext("wp:author_id", namespaces=ns),
                "name": author.findtext("wp:author_display_name", namespaces=ns),
                "email": author.findtext("wp:author_email", namespaces=ns),
            }

        # Preload attachments for cover images
        attachments_map = {}
        for item in root.findall("./channel/item"):
            if item.findtext("wp:post_type", namespaces=ns) == "attachment":
                attachments_map[item.findtext("wp:post_id", namespaces=ns)] = item.findtext("wp:attachment_url", namespaces=ns)

        # Map post ID to cover URL via _thumbnail_id postmeta
        thumbnail_map = {}
        for item in root.findall("./channel/item"):
            if item.findtext("wp:post_type", namespaces=ns) == "post":
                post_id = item.findtext("wp:post_id", namespaces=ns)
                for pm in item.findall("./wp:postmeta", namespaces=ns):
                    if pm.findtext("wp:meta_key", namespaces=ns) == "_thumbnail_id":
                        thumb_id = pm.findtext("wp:meta_value", namespaces=ns)
                        thumbnail_map[post_id] = attachments_map.get(thumb_id)

        # Helper: generate a unique slug (≤255 chars) among the BlogIndexPage’s children
        def _generate_unique_slug(parent_page, title):
            base_slug = slugify(title)[:255] or "post"
            slug = base_slug
            suffix = 1
            while parent_page.get_children().filter(slug=slug).exists():
                slug = f"{base_slug}-{suffix}"
                suffix += 1
            return slug

        # Get BlogIndexPage (must be created manually first)
        try:
            blog_index = BlogIndexPage.objects.first()
            if not blog_index:
                self.stderr.write(
                    "❌ No BlogIndexPage found. Please create one in the Wagtail admin first."
                )
                return
        except BlogIndexPage.DoesNotExist:
            self.stderr.write("❌ BlogIndexPage model not defined.")
            return

        for item in root.findall("./channel/item"):
            post_type = item.find("./wp:post_type", ns)
            if post_type is not None and post_type.text == "post":
                # Capture this post’s ID
                wp_post_id = item.findtext("wp:post_id", namespaces=ns)
                title = item.findtext("title")
                raw_body = html.unescape(item.find("content:encoded", ns).text or "")
                content = clean_wp_html(raw_body)
                pub_date = item.findtext("pubDate")
                # Parse publication date (WordPress uses RFC 2822). Fallback to now().
                try:
                    parsed_date = parsedate_to_datetime(pub_date) if pub_date else None
                except (TypeError, ValueError):
                    parsed_date = None
                if parsed_date is None:
                    parsed_date = timezone.now()

                # --- WordPress author mapping
                login = item.findtext("dc:creator", namespaces=ns)
                author_data = authors_map.get(login)
                if author_data:
                    author_obj, _ = BlogAuthor.objects.get_or_create(
                        wp_id=author_data["id"],
                        defaults={"name": author_data["name"], "email": author_data["email"]},
                    )
                else:
                    author_obj = None

                categories = [
                    c.text
                    for c in item.findall("category")
                    if c.get("domain") == "category"
                ]
                tags = [
                    t.text
                    for t in item.findall("category")
                    if t.get("domain") == "post_tag"
                ]

                blog_page = BlogPage(
                    title=title,
                    slug=_generate_unique_slug(blog_index, title),
                    author=author_obj,
                    body=content,
                    date=parsed_date.date(),
                )

                # Attach and save the blog page
                blog_index.add_child(instance=blog_page)
                blog_page.save()  # Ensure instance is saved first

                # Import cover image if available
                cover_url = thumbnail_map.get(wp_post_id)
                if cover_url:
                    try:
                        resp = requests.get(cover_url)
                        resp.raise_for_status()
                        image_name = os.path.basename(cover_url)
                        image_file = ContentFile(resp.content, name=image_name)
                        wagtail_image = Image.objects.create(title=f"Cover for {title}", file=image_file)
                        blog_page.cover_image = wagtail_image
                        blog_page.save(update_fields=["cover_image"])
                    except Exception as e:
                        self.stderr.write(f"Failed to import cover image for {title}: {e}")

                # Tags (ClusterTaggableManager handles creation)
                if tags:
                    blog_page.tags.add(*tags)

                for cat in categories:
                    category_obj, _ = BlogCategory.objects.get_or_create(
                        slug=slugify(cat), defaults={"name": cat}
                    )
                    blog_page.categories.add(category_obj)

                # Final save and publish
                blog_page.save()
                blog_page.save_revision().publish()

                # ——— prepare to map WordPress comment IDs → BlogComment objects
                comments_map = {}

                # First pass: create comments without parents
                for comment in item.findall("./wp:comment", ns):
                    # Ignore pingbacks / trackbacks (WordPress marks them via <wp:comment_type>)
                    ctype = comment.findtext(
                        "wp:comment_type", default="", namespaces=ns
                    )
                    if ctype and ctype.strip() not in ("", "comment"):
                        continue

                    wp_comment_id = comment.findtext("wp:comment_id", namespaces=ns)
                    parent_wpid = (
                        comment.findtext("wp:comment_parent", namespaces=ns) or None
                    )

                    comment_obj = BlogComment.objects.create(
                        post=blog_page,
                        author=comment.findtext(
                            "wp:comment_author", default="", namespaces=ns
                        ),
                        email=comment.findtext(
                            "wp:comment_author_email", default="", namespaces=ns
                        ),
                        date=timezone.make_aware(
                            parse_datetime(
                                comment.findtext("wp:comment_date", namespaces=ns)
                            )
                            or datetime.now()
                        ),
                        content=html.unescape(
                            comment.findtext(
                                "wp:comment_content", default="", namespaces=ns
                            )
                        ),
                        approved=comment.findtext("wp:comment_approved", namespaces=ns)
                        == "1",
                        parent=None,  # set later
                    )
                    comments_map[wp_comment_id] = (comment_obj, parent_wpid)

                # Second pass: hook up parent relationships now that all comments exist
                for wp_comment_id, (comment_obj, parent_wpid) in comments_map.items():
                    if parent_wpid and parent_wpid in comments_map:
                        parent_obj, _ = comments_map[parent_wpid]
                        comment_obj.parent = parent_obj
                        comment_obj.save(update_fields=["parent"])

                self.stdout.write(
                    f"  ↳ Imported {len(comments_map)} comments, {len(tags)} tags, {len(categories)} categories for '{title}'"
                )

                self.stdout.write(self.style.SUCCESS(f"✅ Imported post: {title}"))

        self.stdout.write(
            self.style.SUCCESS("All blog posts and comments imported successfully.")
        )

You can run that script using:

python manage.py import_wordpress /path/to/wordpress.xml

4. How can you make Wagtail have a WordPress-like experience?

Must-have:

WordPress Feature	Wagtail Equivalent	Why You Probably Need It
Permalink structure	`wagtailredirects`, custom `Route` mixins	Preserve SEO juice & old backlinks.
Menus (Appearance → Menus)	`wagtailmenus`	Drag‑and‑drop nav builder for editors.
Yoast SEO	`wagtailseo`, `wagtail-metadata`	Title/description previews, OpenGraph tags.
Widgets / Sidebars	Snippets + inclusion tags	Recent posts, categories list, etc.

Nice‑to‑haves:

WordPress Feature	Wagtail Equivalent	Notes
Gutenberg Blocks	Custom StreamField blocks	Re‑create fancy layouts with icons & help‑text.
Comments	Disqus embed or `wagtail-commenting`	Offload spam filtering & moderation.
Multilingual (WPML/Polylang)	`wagtail-localize`	Locale‑aware URLs, translation workflow.
Forms (Contact Form 7)	`wagtailformblocks` or native `FormPage`	Email hooks, Akismet spam protection.

5. Conclusion & Key Takeaways

Migrating a mature WordPress site to Django + Wagtail isn’t a weekend hobby project—but it’s far from the multi‑month project many teams would fear.

Unified Python stack → one deployment pipeline, one set of libraries, easier hiring (if needed).
Performance & security gains → Core Web Vitals up, plugin exploits down.
First‑class editor UX → StreamField, image choosers, and drag‑and‑drop menus keep non‑devs happy.
Custom import pipeline → XML‑to‑Wagtail in minutes, not days, with comments, authors, and media intact.
Incremental parity roadmap → tackle must‑have packages first, add “nice‑to‑haves” when you’re ready.

Migrating a WordPress Blog to Django (Wagtail)