HTML Scraping to PDF

February 21, 2020

I was ordering shirts on behalf of a client and wanted to send them a color chart. Rather than send them to the corporate website where the shirts are purchased (which was login only), I thought I would use Photoshop to drag-and-drop the shirt preview images into a document and then export it to PDF. After 20 minutes of dragging and dropping, I thought there was a better way.

Analyzing the Site

When loading the site, tabs appear under each product. These tabs include a description as well as a Colors and Sizes tab. Each color “dot” has a corresponding shirt image associated with it.

<div class="colorOuter selected">
     <a href="https://example.com/blue/image.jpg" rel="zoom-id:mzoom" rev="https://example.com/blue/image-highres.jpg" class="descTooltip colorswch" data-colorcode="381" data-desc="Blue" style="outline: 0px; display: inline-block;">
      <div class="colorInnerWrap">                   
       <div style="background-color:#039543;" class="colorInner">
       </div> 
      </div>
     </a>
 </div>

Digging into the code I found that each link that loads the shirt image has all of the data I need asociated with it as html data attributes: data-desc for the color name and rev for the URL to the shirt image.

Building a script for automation

The easiest way to process the data was to first grab the HTML, which I did easily enough with CURL:

# Find URL and dump into HTML doc
curl https://example.com/url/to/the/page.html > page.html

I then used grep to find any lines in the document containing the all important “rev=” line, which holds both my shirt color and URL.

# Grep any lines with “rev=“ into it’s own HTML doc
grep 'rev=' page.html > links.html

I found that the process of CURLing the HTML down added hex code form certain characters, like / and .. So my rev URL looked like

https&#58;&#47;&#47;example&#46;com&#47;image&#46;jpg

instead of

https://example.com/image.jpg

There are a lot of ways to find and replace, but I found the easiest way was to open the file in VIM and do a find and replace. I’m sure you can find a smoother alternative.

In vim, we run the command %s to change all lines in the file. The first slash contains the encoded data we want to change to a normal UTF-8 character (&#47; = /). The /g at the end of the line tells vim to replace all instances inside the line, not just the first occurance.

vim links.html
:%s/&#47;/\//g 
:%s/&#46;/\./g 
:%s/&#58;/\:/g 

Once I search and replace all instances of the HTML entities, I have something like this:

<a href="https://example.com/image.jpg" rev="https://example.com/image-highres.jpg" rel="zoom-id:mzoom">
	<img src='https://example.com/image.jpg' alt="">
</a> 

In order to build the color chart, we first need to download all images. Later on in the process I will use the wonderful Imagemagick montage command to place all of my images in a grid inside the PDF. Imagemagick commands can use the filename to place a label around an image, so I want to capture the name and URL of each image for safe keeping.

As an intermediate step before I start to curl all the high-res images from the server, I can store the necessary data inside a JSON string.

# This counts each line via wc -l and then prints it out, storing it inside the variable LCT
LCT=$(wc -l links.html | awk '{print $1}')

# Create a counter set to 0
CTR=0

# We’re manually creating a JSON String (probably better way to do it...)
JSTR="["

#start looping through lines inside links.html
while IFS= read -r line
	do
	# Increase our count by 1
	CTR=$(( CTR + 1 ))

	# echo line, find rev and everything inside the quotes, remove rev= and the quotes.
	# This strips out everything and leaves us with just the raw image url.
	imgurl=$(echo $line |  grep -Eo 'rev="[^\"]+"' | sed 's/rev=//g' | sed 's/"//g')

	# echo line, find data-desc and everything inside the quotes, remove data-desc= and the quotes
	# The same as above, but 
	imgname=$(echo $line | grep -Eo 'data-desc="[^\"]+"' | sed 's/data-desc=//g' | sed 's/"//g')

	#Assemble JSON string by appending previous JSTR and then the vars wrapped in quotes
	JSTR="${JSTR}[\"${imgurl}\",\"${imgname}\"]"

	#if we have NOT reached the end of our document, add a comma to separate JSON arrays
	if [[ "$CTR" -ne "$LCT" ]]; then
	    JSTR="${JSTR},"
	fi
done < links.html #Feeding links into loop

JSTR="${JSTR}]” #end JSON string

In the above code, we create a bash variable JSTR and store a string that looks like JSON inside it. The reason for the JSON is we can then parse later using bash and the jq command.

After running the above, we have the JSTR variable stored, with our data inside. The next step is to output it into a file:

echo $JSTR >> data.json
[
	{
		"https://example.com/aslid798/image-highres.jpg",
		"Evergreen",
	},
	{
		"https://example.com/54sdhw/image-highres.jpg",
		"Blue",
	},
	...
]

With our newly generated JSON file we can parse using jq, we can now download the files and name them based on their name found in the HTML.

But first, I am sure you are asking Why save to JSON and then parse? Why not just execute CURL inside the While loop? You can absolutely do this, but for my needs, I wanted a place to review the data and make sure that everything was coming back correctly. There also could be some colors I do not want on the chart. In that case, I can remove them from the JSON file manually.

Generating a PDF

Now we have our JSON file that contains a name and URL for each shirt.

#Now we have our JSON generated. Time to download the files:
jq -c '.[]' data.json | while read i; do
    URL=$(echo $i | jq '.[0]');
    NAME=$(echo $i | jq '.[1]');
    N=$(echo "$NAME" | sed -e 's/^"//' -e 's/"$//');
    U=$(echo "$URL" | sed -e 's/^"//' -e 's/"$//');
    echo "${N}.jpg -> ${U}";
    curl -o "${N}.jpg" $U;
done

jq is a fantastic command line tool that lets bash parse JSON with ease. I’m able to loop through each object inside the array and grab the NAME and URL variables. I have to process them to remove the JSON slashes, assigning the final values to N and U. I echo out a message to the screen Blue.jpg -> https://example.com/54sdhw/image-highres.jpg and then send the variables to CURL.

After CURL runs, the directory will have all of the files named after their respective colors (e.g., Blue.jpg). The only thing left is to tell montage to use the filename as the label (without the extension) and to organize the images.

montage -label '%t' *.jpg -tile 4x4 
	-geometry '300x367+20+20>' 
	-border 5 
	-pointsize 30 NAME-color-chart.pdf

Here is a rundown of each parameter:

  • label %t sets the title in the PDF as the title of the JPG without the extension.
  • tile is x by x in the grid of images.
  • geometry is the size of each item in the pdf.
  • border is border
  • pointsize is the fontsize of the label

Once montage has run, the color chart is done! There should now be a PDF inside the directory with the name you indicated.

Preview of one of my color chartsColor Chart Preview

programming#bash#vim#html scraping

Processing Assets for Static app