Scraping

(This article is currently incomplete. I will work on it more later.)

Using sed and curl to extract information from webpages

—-

Introduction

I'll go over a bit more technicalities in my article on grep, but I wanted to discuss a practical use of UNIX command-line tools. This is also in relation to data-hoarding, but we'll get into that.

The problem

Detective Conan is a very long-running anime. It is basiscally season-less and has been running more-or-less continuously since 1996. As of the time of writing, there are 1036 episodes out, and every week that number ticks up by one. It's also not a masterpiece, so about 42% of that is filler. The easy solution is to just not watch it, but I think we can do better. Here is how I found a list of canon/plot-relevant episodes, and extracted that into something to be used with various other tools at my disposal– and how you can too!

=

First we google (or bing, or duck, or brave) for "detective conan list of important episodes". This is the top result on Google and Brave, and is thankfully also the one I used in the past to come up with the following procedure.

In the browser, it's a pretty good page. Decorated but tasteful, modern but not bloated. Run basically by one guy, who strives to not use JavaScript as much as possible (an honorable goal).

But we're working in the command line. And on the command line, everything is reduced to plaintext. If you're lucky enough to not know what HTML looks like, here's a sample from the aformentioned page (obtained with curl 'https://www.xerblade.com/p/detective-conan-important-episode-list.html?m=0'):

  <span class='dc-list-pos'><span class='dc-list-no'>=Main Plot=</span></span></p></li><li><p><span class='dc-list-pos'><span class='dc-list-no'>971-974</span> [INTL 1028-1031] (manga 1013-1017|V95F10-V96F3):</span> <span class='dc-list-char'>Detective <span lang='ja-Latn'>Chiba</span></span> and <span class='dc-list-char' lang='ja-Latn'>Miike</span> development. <span class='dc-list-char' lang='ja-Latn'>Kuroda</span> development. <span class='dc-list-char' lang='ja-Latn'>Amuro</span> development. Plus a development for <span class='dc-list-char' lang='ja-Latn'>Yumi</span> and <span class='dc-list-char' lang='ja-Latn'>Shūkichi</span> in the background. <span class='dc-list-pos'><span class='dc-list-no'>=Main Plot=</span></span></p></li><li><p><span class='dc-list-pos'><span class='dc-list-no'>983-984</span> [INTL 1040-1041] (manga 1018-1021|V96F4-7):</span> <span class='dc-list-char' lang='ja-Latn'>Heiji</span> case, <span class='dc-list-char' lang='ja-Latn'>Kaitō Kid</span>, <span class='dc-list-char' lang='ja-Latn'>Morofushi</span> development, <span class='dc-list-char' lang='ja-Latn'>Amuro</span> development, and <span class='dc-list-char' lang='ja-Latn'>Hiromitsu</span> development. <span class='dc-list-pos'><span class='dc-list-no'>=Main Plot=</span></span></p></li><li><p><span class='dc-list-pos'><span class='dc-list-no'>993-995</span> [INTL 1050-1052] (manga 1022-1026|V96F8-V97F1):</span> <span class='dc-list-char' lang='ja-Latn'>Sera Masumi</span> development. Also, <

Sorry. You shouldn't have had to endure that. In fact, you probably didn't. I bet your eyes glazed over. In fact, this is just a segment from near the end of a huge pagagraph which actually contains everything we've been looking for. Obviously, just reading it like this isn't really doable. Let's chop it up into some better pieces.

sed is a stream editor. You give it a stream of text, and it applies a rule or conversion to the text according to the expression you put after it (the first argument.) It is often used for search-and-replace.

For example, we can convert every instance of >< to, say, > < or better yet to

>
<

Here's the command now:

  curl 'https://www.xerblade.com/p/detective-conan-important-episode-list.html?m=0' | sed 's/></>\n</g'

The s/thing1/thing2/g tells sed to _S_ubstitute thing1 with thing2 _G_lobally, every time in the line. Without the g it would only do the first one (which is useful but not here). That \n is what's called a regular expression– this one specifically means "newline". You often can't type an actual Enter character on the command line because it'll try to execute, so using \n works better.

Anyway.

Now the bits of that paragraph each look something like this:

  <span class='dc-list-pos'>
    <span class='dc-list-no'>971-974</span> [INTL 1028-1031] (manga 1013-1017|V95F10-V96F3):</span> <span class='dc-list-char'>Detective <span lang='ja-Latn'>Chiba</span>
  </span> and <span class='dc-list-char' lang='ja-Latn'>Miike</span> development. <span class='dc-list-char' lang='ja-Latn'>Kuroda</span> development. <span class='dc-list-char' lang='ja-Latn'>Amuro</span> development. Plus a development for <span class='dc-list-char' lang='ja-Latn'>Yumi</span> and <span class='dc-list-char' lang='ja-Latn'>Shūkichi</span> in the background. <span class='dc-list-pos'>
    <span class='dc-list-no'>=Main Plot=</span>
  </span>
  </p>
  </li>
  <li>
    <p>

Not completely immune to overlong lines, but it's something. Now pay attention to that line that says class ='dc-list-no'>971-974</span>. If we go back to the browser and control-F '971-974' we find… that there are no results because the page is portioned out per hundred episodes. Let's click on the 900s tab and try again.

Aaand it looks like this:

971-974 [INTL 1028-1031] (manga 1013-1017|V95F10-V96F3): Detective Chiba and Miike development. Kuroda development. Amuro development. Plus a development for Yumi and Shūkichi in the background. \**Main Plot*\*

As you can see, the author has very kindly left a keyword next to every episode number or range that does not appear in the page itself: dc-list-no. Let's try grep ing the whole mess for that string.

  curl 'https://www.xerblade.com/p/detective-conan-important-episode-list.html?m=0' | sed 's/></>\n</g' | grep 'dc-list-no'

Now we end up with blocks that look like

  <span class='dc-list-no'>971-974</span> [INTL 1028-1031] (manga 1013-1017|V95F10-V96F3):</span> <span class='dc-list-char'>Detective <span lang='ja-Latn'>Chiba</span>
    <span class='dc-list-no'>*Main Plot*</span>

Hmmm. Not perfect. But we're getting there. Most lines now either contain the episode numbers like we want, or not much numbers at all. There are some like

  <span class='dc-list-no'>(Both 1 Hour Specials)</span>

but that's probably alright as you'll see.

—-

What we can do now is filter for the numbers themselves. We know that there are >< around the numbers, and they can be in either XYY-XYZ format, or just a single number.

In extended regex, a sequence in [square brackets] denotes a set of possible matches of each character. So, for example, [ckq] will match a lowercase c, or a lowercase k, or a lowercase q. The spec also allows for comon ranges, like [A-Z] (any uppercase letter), [a-z] (any lowercase letter), or for our use case, [0-9], any single number digit.

So we can

  curl 'https://www.xerblade.com/p/detective-conan-important-episode-list.html?m=0' | sed 's/></>\n</g' | grep 'dc-list-no' | sed 's/

… This is never getting finished. I don't have the time or energy.

Here's a better version by the legendary Chokerman:

  curl -s 'https://www.xerblade.com/p/detective-conan-important-episode-list.html?m=0'|grep -Eo "dc-list-no\'>[0-9].*?<"|sed -nr 's/.*>(.*)<.*/\1/p'

(inciting incident here)