Blogger Comments Exported

I am in the middle of moving my blog over from Blogger to self-hosting it and generating it with Pelican. One of the struggles was what to do with comments. Something like Disqus could work, but the philosophy of externally hosting comments doesn’t seem to jibe very well with the philosophy of a static website, like this one. In the end, I discovered Bernhard Scheirle’s Pelican Comment System! New comments are submitted via a mailto: link (which generates an email to me), and then each comment is stored on the backend as a separate file. The only problem left was how to import my existing comments from Blogger.

Blogger is good in that it will give you an export of everything, but the bad news is it’s one giant XML file. XML is great if you’re a computer, but a bit of a pain if you’re a human. Add to that, I could not find the format documented anyway. After much trial and error, I was able to pull out what I needed. I’ll present the code I used to do it (Python 3.6) and then explain what it does.

  1 #! python3.6
  2 """
  3 Export Comments from BLogger XML
  4 
  5 Takes in a Blogger export XML file and spits out each comment in a seperate
  6 file, such that can be used with the [Pelican Comment System]
  7 (https://bernhard.scheirle.de/posts/2014/March/29/static-comments-via-email/).
  8 
  9 May be simple to extend to export posts as well.
 10 
 11 For a more detailed description, read my blog post at
 12     http://blog.minchin.ca/2016/12/blogger-comments-exported.html
 13 
 14 Author: Wm. Minchin -- minchinweb@gmail.com
 15 License: MIT
 16 Changes:
 17 
 18  - 2016.12.29 -- initial release
 19 """
 20 
 21 from pathlib import Path
 22 
 23 import untangle
 24 
 25 ###############################################################################
 26 # Constants                                                                   #
 27 ###############################################################################
 28 
 29 BLOGGER_EXPORT = r'c:\tmp\blog.xml'
 30 COMMENTS_DIR = 'comments'
 31 COMMENT_EXT = '.md'
 32 AUTHORS_FILENAME = 'authors.txt'
 33 
 34 ###############################################################################
 35 # Main Code Body                                                              #
 36 ###############################################################################
 37 
 38 authors_and_pics = []
 39 
 40 
 41 def main():
 42     obj = untangle.parse(BLOGGER_EXPORT)
 43 
 44     templates = 0
 45     posts = 0
 46     comments = 0
 47     settings = 0
 48     others = 0
 49 
 50     for entry in obj.feed.entry:
 51         try:
 52             full_type = entry.category['term']
 53         except TypeError:
 54             # if a post is under multiple categories
 55             for my_category in entry.category:
 56                 full_type = my_category['term']
 57                 # str.find() uses a return of `-1` to denote failure
 58                 if full_type.find('#') != -1:
 59                     break
 60             else:
 61                 others += 1
 62                 print(i)
 63 
 64         simple_type = full_type[full_type.find('#')+1:]
 65 
 66         if 'settings' == simple_type:
 67             settings += 1
 68         elif 'post' == simple_type:
 69             posts += 1
 70             # process posts here
 71         elif 'comment' == simple_type:
 72             comments += 1
 73             process_comment(entry, obj)
 74         elif 'template' == simple_type:
 75             templates += 1
 76         else:
 77             others += 1
 78 
 79     export_authors()
 80 
 81     print(f'''
 82             {templates} template
 83             {posts} posts (including drafts)
 84             {comments} comments
 85             {settings} settings
 86             {others} other entries''')
 87 
 88 
 89 def process_comment(entry, obj):
 90     # e.g. "tag:blogger.com,1999:blog-26967745.post-4115122471434984978"
 91     comment_id = entry.id.cdata
 92     # in ISO 8601 format, usable as is
 93     comment_published = entry.published.cdata
 94     comment_body = entry.content.cdata
 95     comment_post_id = entry.thr_in_reply_to['ref']
 96     comment_author = entry.author.name.cdata
 97     comment_author_pic = entry.author.gd_image['src']
 98     comment_author_email = entry.author.email.cdata
 99 
100     # add author and pic to global list
101     global authors_and_pics
102     authors_and_pics.append((comment_author, comment_author_pic))
103 
104     # use this for a filename for the comment
105     # e.g. "4115122471434984978"
106     comment_short_id = comment_id[comment_id.find('post-')+5:]
107 
108     comment_text = "date: {}\nauthor: {}\nemail: {}\n\n{}\n"\
109                         .format(comment_published,
110                                 comment_author,
111                                 comment_author_email,
112                                 comment_body)
113 
114     # article
115     for entry in obj.feed.entry:
116         entry_id = entry.id.cdata
117         if entry_id == comment_post_id:
118             article_entry = entry
119             break
120     else:
121         print("No matching article for comment", comment_id, comment_post_id)
122         # don't process comment further
123         return
124 
125     # article date published
126     article_publshed = article_entry.published.cdata
127 
128     # article slug
129     for link in article_entry.link:
130         if link['rel'] == 'alternate':
131             article_link = link['href']
132             break
133     else:
134         article_title = article_entry.title.cdata
135         print('Could not find slug for', article_title)
136         article_link = article_title.lower().replace(' ', '-')
137 
138     article_slug = article_link[article_link.rfind('/')+1:
139                                                     article_link.find('.html')]
140 
141     comment_filename = Path(COMMENTS_DIR).resolve()
142     # folder; if it doesn't exist, create it
143     comment_filename = comment_filename / article_slug
144     comment_filename.mkdir(parents=True, exist_ok=True)
145     # write the comment file
146     comment_filename = comment_filename / (comment_short_id + COMMENT_EXT)
147     comment_filename.write_text(comment_text)
148 
149 
150 def export_authors():
151     to_export = set(authors_and_pics)
152     to_export = list(to_export)
153     to_export.sort()
154 
155     str_export = ''
156     for i in to_export:
157         str_export += (i[0] + '\t\t' + i[1] + '\n')
158 
159     authors_filename = Path(COMMENTS_DIR).resolve() / AUTHORS_FILENAME
160     authors_filename.write_text(str_export)
161 
162 
163 if __name__ == "__main__":
164     main()

The code is written to run in Python 3.6, which was released a few days ago. If you haven’t upgraded yet, I think the only 3.6-specific feature I used was the f-string at line 81.

The other change you will need to do to use this code for yourself is the update the configuration at the top of the file (lines 25-32). You will also need the XML export of your Blogger blog.

untangle is the XML library I used. It seemed to work well for the task at hand. It can easily be installed from pip:

pip install untangle

THe first task was figuring out what was all in Blogger XML export. The bulk of it was “entry“‘s — the first one was my HTML template, the next batch was a bunch of Blogger settings, the third batch was my posts, and the last batch was the comments. Which one it fell into could be determined by looking at the entry.category['term'] (see line 52). This would give a string (a blogger URL) that ended in “#template”, “#settings”, “#post”, or “#comment” as the case may be.

If I had not already exported my entries, this would have been the way to do it (see line 70).

Comments were processed by pulling the information I wanted out (see lines 90-98), determining what post the comment was attached to (see lines 115-123), and then write all the data to separate markdown files (see line 147). Each comment is exported into a folder named after the slug of the entry it was attached to. Renaming these folders proved one of the more annoying parts, as I had cleaned up the slugs of many of my posts during their initial export. Then it was to regenerate my blog (with the comments turned on), and make sure everything was working as expected. The slug renaming was the only (minor) show-stopper I ran into.

In other cleanup, I has also changed my name (as the default blog author) during the initial export on the blog, so I had to change that anywhere it appeared. As a final touch, I brought over some of the profile pictures of the other commenters, where available (this is what the authors.txt file, generated by lines 150-160 is designed to help with). These are configured as follows in my main pelicanconf.py file (the configuration file for Pelican):

PELICAN_COMMENT_SYSTEM_AUTHORS = {
    ('PROTIK KHAN', 'noreply@blogger.com'): "images/authors/rabiul_karim.webp",
    ('Matthew Hartzell', 'noreply@blogger.com'): "images/authors/matthew_hartzell.webp",
    ('Jens-Peter Labus', 'noreply@blogger.com'): "images/authors/jens-peter_labus.png",
    ('Bridget', 'noreply@blogger.com'): "images/authors/bridget.jpg",
    ('melissaclee', 'noreply@blogger.com'): "images/authors/melissa_lee.jpg",
    ('Melissa', 'noreply@blogger.com'): "images/authors/melissa_lee.jpg"
}

Hopefully some of this code will prove useful to someone else dealing with their Blogger export.

The code has also been posted as a gist on GitHub, so you’re welcome to submit improvements as well. If you want to download the code, this may be the simplest place to get it from.

Code is under the MIT license.

Comments

Bernhard Scheirle on Sunday, January 8, 2017

To make this nice script more discoverable for new Pelican Comment System users, why don’t you add it to the plug-in repository?

Maybe like this would be sensible:

.
├── doc
│   ├── ...
│   └── import.md
├── identicon
├── import
│   └── blogger_comment_export.py
└── theme

And by the way nice theme!

Wm. Minchin on Tuesday, January 10, 2017

Pull request created!

Name

Website

Comment

You can use the Markdown syntax to format your comment.

or alternately, send me your thoughts at minchinweb [at] gmail.com

Comment Atom Feed (for this post)