I Like RPGs
Growing up I played a lot of console RPGs. If you’re a die-hard RPG fan then you might be familiar with how it starts: you play a little Super Mario RPG and Chrono Trigger as a kid and get a taste for it, you end up running through all of the Tales Of and Final Fantasy games, and before long you’re scrutinizing the differences between the battle systems of Shadow Hearts and Grandia like a wine connoisseur babbling about the legs on the side of their Bordeaux glass. But then, like that lush that’s shotgunning boxed wine on the subway, you eventually take the addiction too far and end up digging through the discount bins in your local GameCrazy and playing whatever you can get: stuff like Beyond the Beyond, Thousand Arms, and Evolution.
My opinion of that genre has cooled considerably over time, but there is obviously still room to make an RPG that is really amazing and that has valuable things to say in an interesting way. Partly out of a sense of nostalgia, and partly because I’d like to try contributing something to a form that has given me so much joy over the years, I am going to try to make my own little spin on the RPG genre! It’s not like I have absolutely no experience making RPGs.
The general idea of the game is still in its formative stages in my head, but one thing I know for a certainty is that the approach I want to take requires me to have a ton of different characters, and therefore an equally large number of character portraits. What makes this difficult is that when I say “a ton” I don’t mean something like the 108 Stars of Destiny; I mean numbers in the thousands. So even if I was capable of drawing (I am not yet), it would probably be impractical for me to draw 2,000 character portraits. So what to do?
Finding Character Portraits
I was hoping that I would be able to find a nice online resource of freely available RPG character portraits like I remembered from the good old days of playing around with RPG Maker. Unfortunately, it seems like a lot of those places simply don’t exist any longer, so I’ve come to the conclusion that I need to “make” my own en masse. In this case “make” means “grab faces from art old enough to be in the public domain.”
That’s the thought that led me to WikiArt.org, a great site which at first glance seems to have about a thousand times more pieces of artwork than I even knew existed. I looked around and couldn’t see anything about an API they provide, and the next closest thing I found revealed that they used to provide a useful json file full of all of their data (as per this old code that relied upon it) but now don’t seem to. So I guess I’m going to have to crawl around this site and pull all of the info off myself, which is fortunately pretty fun!
Scraping WikiArt with C#
I’ve used python and BeautifulSoup in the past to scrape websites (for example in the poetry section of my old WakeUp project and my dumb little Randcamp toy) and they have been so easy and pleasant to use that it almost felt like cheating to use them again. I wanted to try something else to see how it compared. Since I use C# at work anyway but have never really had the chance to scrape websites with it in that context, I might as well give that a shot!
The best tool I found for the job was the HTML Agility Pack which, despite having a somewhat ugly and boring name, is actually rather fun and easy to work with! After NuGetting the library and setting up a project, grabbing a page’s contents is incredibly easy:
HtmlWeb webHandler = new HtmlWeb();
HtmlDocument doc = webHandler.Load(new Uri("https://www.wikiart.org/en/claude-monet/the-promenade-woman-with-a-parasol"));
And parsing through them to find the desired elements is also very easy:
this.imageURL = new Uri(WebUtility.HtmlDecode(doc.DocumentNode.Descendants("img").Where(i => i.Attributes["itemprop"].Value == "image").First().Attributes["src"].Value));
And lastly, pulling the image is naturally quite simple:
using (WebClient wc = new WebClient()) {
using (Stream s = wc.OpenRead(this.ImageURL)) {
new Bitmap(s).Save("C:/WikiArt/image.jpg");
}
}
With the code being this straightforward, scraping the site was a somewhat simple process:
- Grab the list of artists (I had to do this by letter). You can see this in the
WikiArtApi.GetArtistsByLetter
method of the WikiArtApi library. - For each artist, grab the list of their works. Fortunately, these URLs are logical and so I can deduce them by simply knowing the author’s name, which I found in Step 1. I do this in
Artist.GetAllArtworks
. - For each of these artworks, navigate to the artwork’s page and find the location of the image. This is done in
Artwork.PullDetailsFromPage
. I was hoping that, having just the artworks’ names, I would be able to skip this step and navigate straight to the image location (similarly to how I avoided hitting the artists’ pages and skipped straight to their lists of works); unfortunately, as I will detail below, it simply wasn’t possible. - Download the image from the URL acquired in Step 3.
Of course, scraping a site is never as easy as you think it will be, and there are always little idiosyncracies that cause problems. Although WikiArt is a pretty well-organized site, there were still some things that were confusing and/or peculiar.
Some Things That Were Confusing and/or Peculiar
Inconsistent Artwork Page URLs vs. Artwork Filenames
Sometimes the page for a piece of artwork and the artwork’s filename are consistent, like Grant Wood’s American Gothic.
- Summary Page URL: https://www.wikiart.org/en/grant-wood/american-gothic-1930
- Image URL: https://uploads2.wikiart.org/images/grant-wood/american-gothic-1930.jpg
However, sometimes the page and image names are not consistent at all, as is the case with this Byzantine Mosaic:
- Summary Page URL: https://www.wikiart.org/en/byzantine-mosaics/erzbischofliche-kapelle-425
- Image URL: https://uploads2.wikiart.org/00211/images/byzantine-mosaics/ravenna-cappella-arcivescovile-166.jpg
I don’t know what the explanation for this is, but it did make my life more difficult. As mentioned in Step 3 above, this meant that I couldn’t simply take a work’s title (as found on the artist’s works list) and skip straight to the image, but would instead have to actually go to the work’s page and grab the image URL from there. It was only a minor bump in the road, but it did mean that I would need to send practically twice as many requests to WikiArt’s servers.
Pages Sometimes Wouldn’t Load Correctly
Every 10,000 artwork pages or so I would get a very confusing error saying that there was no image on the retrieved page. When I ran the code again, the error disappeared, only to pop up again much later. I couldn’t find any pattern to these errors, but I did grab the html for a page from when it worked and when it didn’t work just to compare them; it appears as if the <main ng-controller class="ArtworkViewCtrl">
element simply didn’t populate on the page in the cases where I ran into the problem. I am not sure why that is, but since it only happened about 0.008% of the time I am going to put figuring it out on the backburner.
A Few Artists Have Weird Artworks List Pages
Richard Anuszkiewicz, or his agent, doesn’t want WikiArt to host his images. Yet, unlike a lot of other artists who have gone this route, his artworks list page still contains links: links that go to his personal site instead of WikiArt. These links, naturally, broke my scraper, so I had to edit it to ignore links to non-WikiArt pages. Thanks, Dick.
Artworks Sometimes Have Year In URL
Sometimes the URL for an artwork will have the year in it, like old-man-sleeping-1629. However, sometimes it won’t, as in old-man-with-turban. This caused me a tiny bit of confusion when I was first working out the patterns for the URLs, since I was looking for some level of consistency in the filenames (before I realized it didn’t exist). An interesting tidbit about this: the URLs that contain the year by default (like the aforementioned Old Man Sleeping) don’t work without the year in them; works that don’t contain the year by default (like Old Man With Turban) actually do work with the year included (but only if the year is correct), automatically redirecting to the correct, yearless URL. That is cool and I’m surprised they bothered with it, since I imagine it involves unnecessary mucking around with Angular’s routing.
S Tags?
This didn’t affect my scraping at all, but I still found it interesting. I don’t think I’ve ever actually seen <s>
tags before, and I find it peculiar that WikiArt uses them to signify labels for artwork/artist metadata. Then again, people use <i>
for icons all the time even though that’s not correct either, so I guess the rules don’t always matter.
The Result
After running the code for quite a long while (with a lot of waiting between calls so I didn’t hit WikiArt’s server too hard) I finally managed to scrape every last bit of WikiArt and pull all of the images: all 10GB and 150,000 of them. While this satisfies the datahoarder within me, it doesn’t really solve my RPG character portrait problem all on its own. Now I need to… pull the faces from them all. I guess we’ll get to that in the next installment.
A link to the code that I threw together to pull these data and images from WikiArt.org, as well as a library .dll for anyone interested in just using it, can be found on my WikiArtApi project page. WikiArt has obviously been indispensible to me for this process, so I implore everyone to donate to them if possible!