I am in the middle of moving my blog over from Blogger to self-hosting it and generating it with Pelican. One of the struggles was what to do with comments. Something like Disqus could work, but the philosophy of externally hosting comments doesn’t seem to jibe very well with the philosophy of a static website, like this one. In the end, I discovered Bernhard Scheirle’s Pelican Comment System! New comments are submitted via a mailto: link (which generates an email to me), and then each comment is stored on the backend as a separate file. The only problem left was how to import my existing comments from Blogger.
Blogger is good in that it will give you an export of everything, but the bad news is it’s one giant XML file. XML is great if you’re a computer, but a bit of a pain if you’re a human. Add to that, I could not find the format documented anyway. After much trial and error, I was able to pull out what I needed. I’ll present the code I used to do it (Python 3.6) and then explain what it does.
1 #! python3.6
2 """
3 Export Comments from BLogger XML
4
5 Takes in a Blogger export XML file and spits out each comment in a seperate
6 file, such that can be used with the [Pelican Comment System]
7 (https://bernhard.scheirle.de/posts/2014/March/29/static-comments-via-email/).
8
9 May be simple to extend to export posts as well.
10
11 For a more detailed description, read my blog post at
12 http://blog.minchin.ca/2016/12/blogger-comments-exported.html
13
14 Author: Wm. Minchin -- minchinweb@gmail.com
15 License: MIT
16 Changes:
17
18 - 2016.12.29 -- initial release
19 """
20
21 from pathlib import Path
22
23 import untangle
24
25 ###############################################################################
26 # Constants #
27 ###############################################################################
28
29 BLOGGER_EXPORT = r'c:\tmp\blog.xml'
30 COMMENTS_DIR = 'comments'
31 COMMENT_EXT = '.md'
32 AUTHORS_FILENAME = 'authors.txt'
33
34 ###############################################################################
35 # Main Code Body #
36 ###############################################################################
37
38 authors_and_pics = []
39
40
41 def main():
42 obj = untangle.parse(BLOGGER_EXPORT)
43
44 templates = 0
45 posts = 0
46 comments = 0
47 settings = 0
48 others = 0
49
50 for entry in obj.feed.entry:
51 try:
52 full_type = entry.category['term']
53 except TypeError:
54 # if a post is under multiple categories
55 for my_category in entry.category:
56 full_type = my_category['term']
57 # str.find() uses a return of `-1` to denote failure
58 if full_type.find('#') != -1:
59 break
60 else:
61 others += 1
62 print(i)
63
64 simple_type = full_type[full_type.find('#')+1:]
65
66 if 'settings' == simple_type:
67 settings += 1
68 elif 'post' == simple_type:
69 posts += 1
70 # process posts here
71 elif 'comment' == simple_type:
72 comments += 1
73 process_comment(entry, obj)
74 elif 'template' == simple_type:
75 templates += 1
76 else:
77 others += 1
78
79 export_authors()
80
81 print(f'''
82 {templates} template
83 {posts} posts (including drafts)
84 {comments} comments
85 {settings} settings
86 {others} other entries''')
87
88
89 def process_comment(entry, obj):
90 # e.g. "tag:blogger.com,1999:blog-26967745.post-4115122471434984978"
91 comment_id = entry.id.cdata
92 # in ISO 8601 format, usable as is
93 comment_published = entry.published.cdata
94 comment_body = entry.content.cdata
95 comment_post_id = entry.thr_in_reply_to['ref']
96 comment_author = entry.author.name.cdata
97 comment_author_pic = entry.author.gd_image['src']
98 comment_author_email = entry.author.email.cdata
99
100 # add author and pic to global list
101 global authors_and_pics
102 authors_and_pics.append((comment_author, comment_author_pic))
103
104 # use this for a filename for the comment
105 # e.g. "4115122471434984978"
106 comment_short_id = comment_id[comment_id.find('post-')+5:]
107
108 comment_text = "date: {}\nauthor: {}\nemail: {}\n\n{}\n"\
109 .format(comment_published,
110 comment_author,
111 comment_author_email,
112 comment_body)
113
114 # article
115 for entry in obj.feed.entry:
116 entry_id = entry.id.cdata
117 if entry_id == comment_post_id:
118 article_entry = entry
119 break
120 else:
121 print("No matching article for comment", comment_id, comment_post_id)
122 # don't process comment further
123 return
124
125 # article date published
126 article_publshed = article_entry.published.cdata
127
128 # article slug
129 for link in article_entry.link:
130 if link['rel'] == 'alternate':
131 article_link = link['href']
132 break
133 else:
134 article_title = article_entry.title.cdata
135 print('Could not find slug for', article_title)
136 article_link = article_title.lower().replace(' ', '-')
137
138 article_slug = article_link[article_link.rfind('/')+1:
139 article_link.find('.html')]
140
141 comment_filename = Path(COMMENTS_DIR).resolve()
142 # folder; if it doesn't exist, create it
143 comment_filename = comment_filename / article_slug
144 comment_filename.mkdir(parents=True, exist_ok=True)
145 # write the comment file
146 comment_filename = comment_filename / (comment_short_id + COMMENT_EXT)
147 comment_filename.write_text(comment_text)
148
149
150 def export_authors():
151 to_export = set(authors_and_pics)
152 to_export = list(to_export)
153 to_export.sort()
154
155 str_export = ''
156 for i in to_export:
157 str_export += (i[0] + '\t\t' + i[1] + '\n')
158
159 authors_filename = Path(COMMENTS_DIR).resolve() / AUTHORS_FILENAME
160 authors_filename.write_text(str_export)
161
162
163 if __name__ == "__main__":
164 main()
The code is written to run in Python 3.6, which was released a few days ago. If you haven’t upgraded yet, I think the only 3.6-specific feature I used was the f-string at line 81.
The other change you will need to do to use this code for yourself is the update the configuration at the top of the file (lines 25-32). You will also need the XML export of your Blogger blog.
untangle is the XML library I used. It seemed to work well for the task at hand. It can easily be installed from pip:
pip install untangle
THe first task was figuring out what was all in Blogger XML export. The bulk of it was “entry“‘s — the first one was my HTML template, the next batch was a bunch of Blogger settings, the third batch was my posts, and the last batch was the comments. Which one it fell into could be determined by looking at the entry.category['term'] (see line 52). This would give a string (a blogger URL) that ended in “#template”, “#settings”, “#post”, or “#comment” as the case may be.
If I had not already exported my entries, this would have been the way to do it (see line 70).
Comments were processed by pulling the information I wanted out (see lines 90-98), determining what post the comment was attached to (see lines 115-123), and then write all the data to separate markdown files (see line 147). Each comment is exported into a folder named after the slug of the entry it was attached to. Renaming these folders proved one of the more annoying parts, as I had cleaned up the slugs of many of my posts during their initial export. Then it was to regenerate my blog (with the comments turned on), and make sure everything was working as expected. The slug renaming was the only (minor) show-stopper I ran into.
In other cleanup, I has also changed my name (as the default blog author) during the initial export on the blog, so I had to change that anywhere it appeared. As a final touch, I brought over some of the profile pictures of the other commenters, where available (this is what the authors.txt file, generated by lines 150-160 is designed to help with). These are configured as follows in my main pelicanconf.py file (the configuration file for Pelican):
PELICAN_COMMENT_SYSTEM_AUTHORS = {
('PROTIK KHAN', 'noreply@blogger.com'): "images/authors/rabiul_karim.webp",
('Matthew Hartzell', 'noreply@blogger.com'): "images/authors/matthew_hartzell.webp",
('Jens-Peter Labus', 'noreply@blogger.com'): "images/authors/jens-peter_labus.png",
('Bridget', 'noreply@blogger.com'): "images/authors/bridget.jpg",
('melissaclee', 'noreply@blogger.com'): "images/authors/melissa_lee.jpg",
('Melissa', 'noreply@blogger.com'): "images/authors/melissa_lee.jpg"
}
Hopefully some of this code will prove useful to someone else dealing with their Blogger export.
The code has also been posted as a gist on GitHub, so you’re welcome to submit improvements as well. If you want to download the code, this may be the simplest place to get it from.
Code is under the MIT license.
Comments
To make this nice script more discoverable for new Pelican Comment System users, why don’t you add it to the plug-in repository?
Maybe like this would be sensible:
And by the way nice theme!
Pull request created!