citruspi

A Docset for PowerShell

Building a Docset for PowerShell Cmdlets

Last week, a friend and I had a conversation that went something like this:

him: http://kapeli.com/docset_links
me: Do you use Dash?
him: I used to
him: I am going to use now for c# i guess
him: I am trying to see if they have powershell cmdlets
me: I use it, particularly for languages I'm learning to look at the usage for different functions.
me: I feel like you could create a docset for powershell cmdlets if there isn't one.
him: it has all the things except for what I am looking for
him: yeah but I am lazy i want someone else to do that

Well, challenge accepted.

After doing a quick online search to make sure that I wasn't duplicating an existing effort, I started working on creating a docset for Dash which documented the PowerShell cmdlets.

Later that evening, or rather a couple hours into the next morning, I had a working script which produced a docset which did just that.

Cmdlet Reference Pages

I started by browsing the Scripting with Windows PowerShell documentation, trying to determine the best way of finding all the documentation on cmdlets.

I decided that the easiest way, for now, would be to find all the cmdlet reference pages. These are pages that have a list of different cmdlets, with

  1. a direct link to the documentation
  2. a name and
  3. a description

for each cmdlet.

I ended up compiling a YAML list of over 140 of these reference pages. In hindsight I wish I'd just taken some extra time to write the code required to scrape the root documentation page and find all the cmdlet reference pages.

For each of these reference pages I stored the page's name and link.

 1 ...
 2 - name: CoreModulesHost
 3   url: https://technet.microsoft.com/en-us/library/hh849689(v=wps.640).aspx
 4 - name: CoreModulesManagement
 5   url: https://technet.microsoft.com/en-us/library/hh849827(v=wps.640).aspx
 6 - name: CoreModulesODataUtils
 7   url: https://technet.microsoft.com/en-us/library/dn818506(v=wps.640).aspx
 8 - name: CoreModulesSecurity
 9   url: https://technet.microsoft.com/en-us/library/hh849807(v=wps.640).aspx
10 ...

Indexing Reference Pages

Once I had the reference pages, I started working on the code to get all the individual cmdlets. Instead of taking the logical route and scraping the table provided, I used Beautiful Soup to scrape the table of contents on the left side to find all the elements with data-toclevel=2 which indicated that it was a link to a page documenting a cmdlet.

 1 entries = []
 2 
 3 for index in indexes:
 4     r = requests.get(index['url'])
 5 
 6     if r.status_code == 200:
 7         soup = BeautifulSoup(r.content)
 8 
 9         for div in soup.find_all('div'):
10             try:
11                 if div['data-toclevel'] == '2':
12                     entries.append({
13                         'link': div.a.attrs['href'].strip(),
14                         'title': div.a.attrs['title'],
15                         'path': div.a.attrs['title']+'.html'
16                     })
17 
18             except KeyError:
19                 pass
20 
21     else:
22         print 'Failed to index {index} ({code})'.format(index = index['name'],
23                                                         code = r.status_code)

Downloading the Documentation

Arguably the easiest part, once I had the list of all the cmdlet documentation pages, I proceeded to loop over the list and download each page.

 1 for entry in entries:
 2     r = requests.get(entry['link'])
 3 
 4     if r.status_code == 200:
 5         with open(entry['path'], 'w') as f:
 6             f.write(r.content)
 7 
 8     else:
 9         print 'Failed to download "{title}" ({code})'.format(
10                                                         code = r.status_code,
11                                                         title = title)

Rewriting the Documentation

Once I had downloaded all the pages documenting PowerShell cmdlets, I quickly discovered I had two problems.

Busy Documentation

The pages included the title bar and search field on the top, the table of contents on the right, the feedback section on the bottom, etc. These were not only unnecessary, but they affected the documentation negatively by making the page busy.

I opened up the web inspector and made a list of all the elements I wanted to remove. With that list in hand, I looped over each of the documentation pages and used Beautiful Soup to find and remove those elements.

 1 for entry in entries:
 2     source = open(entry['path'], 'r+')
 3 
 4     soup = BeautifulSoup(source.read())
 5 
 6     unnecessary = [
 7         '#megabladeContainer',
 8         '#ux-header',
 9         '#isd_print',
10         '#isd_printABook',
11         '#expandCollapseAll',
12         '#leftNav',
13         '.feedbackContainer',
14         '#isd_printABook',
15         '.communityContentContainer',
16         '#ux-footer'
17     ]
18 
19     for u in unnecessary:
20         if u[0] == '#':
21             try:
22                 soup.find(id=u[1:]).decompose()
23             except AttributeError:
24                 pass
25 
26         elif u[0] == '.':
27             for element in soup.find_all('div', class_=u[1:]):
28                 element.decompose()
29 
30     source.seek(0)
31     source.write(str(soup))
32     source.truncate()
33     source.close()

External Links

The documentation now looked clean, but I still had a problem. Links in the source code used absolute paths instead of relative ones, which meant that every link to another cmdlet opened the documentation in the web browser. This lead to

  1. a poor experience
  2. broken links when offline

Once again, Beautiful Soup came to the rescue. I opened up each of the downloaded documents, looked for links to other downloaded documents, and replaced them.

 1 for entry in entries:
 2     source = open(entry['path'], 'r+')
 3 
 4     soup = BeautifulSoup(source.read())
 5 
 6     for link in soup.find_all('a'):
 7         for entry in entries:
 8             try:
 9                 if link.attrs['href'] == entry['link']:
10                     link.attrs['href'] = entry['path']
11             except KeyError:
12                 pass
13 
14     source.seek(0)
15     source.write(str(soup))
16     source.truncate()
17     source.close()

The Docset

Once I had all the entries indexed and downloaded, I set about actually creating the docset. Following the instructions on the developer's website, I created the directory structure.

$ mkdir -p PowerShell.docset/Contents/Resources/Documents/

I placed Info.plist in PowerShell.docset/Contents.

 1 <?xml version="1.0" encoding="UTF-8"?>
 2 <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
 3 <plist version="1.0">
 4 <dict>
 5     <key>CFBundleIdentifier</key>
 6     <string>powershell</string>
 7     <key>CFBundleName</key>
 8     <string>PowerShell</string>
 9     <key>DocSetPlatformFamily</key>
10     <string>powershell</string>
11     <key>isDashDocset</key>
12     <true/>
13 </dict>
14 </plist>

I copied the downloaded documentation into the required directory and then used sqlite3 to create and populate the necessary tables.

 1 path = 'PowerShell.docset/Contents/Resources/docSet.dsidx'
 2 
 3 database = sqlite3.connect(path)
 4 
 5 cursor = database.cursor()
 6 
 7 try: cursor.execute('DROP TABLE searchIndex;')
 8 except: pass
 9 
10 cursor.execute('CREATE TABLE searchIndex(id INTEGER PRIMARY KEY, name TEXT, type TEXT, path TEXT);')
11 cursor.execute('CREATE UNIQUE INDEX anchor ON searchIndex (name, type, path);')
12 
13 inserts = [(entry['title'], 'Command', entry['path']) for entry in entries]
14 
15 cursor.executemany('insert into searchIndex(name, type, path) values (?,?,?)', inserts)
16 
17 database.commit()
18 database.close()

I was then able to load PowerShell.docset into Dash and browse and search the PowerShell cmdlet documentation.

Current State

Source Code

The code is available on GitHub and dedicated to the public domain.

Assuming you have Python and pip installed, building it is as easy as running

$ make dependencies
$ make

once you've cloned the repository.

Stats

When I last built it, on the 11th of March, it took about 75 minutes to build the docset which includes 4,970 different cmdlets. The end result was ~120 MB raw and ~10 MB once compressed.

Installation

Since manually building it is a pain, I'm hosting a docset feed located at

http://powershell.docset.citruspi.io/feed/

Click here to have Dash subscribe to the feed.

(The feed links to a compressed copy).

I'd like to put a script together which will rebuild the docset every week, calculate it's checksum to determine if something has changed, and automatically update the feed.

Future Plans


Published on 18 March 2015.