Skip to content

jwarwick/PatentCrawler

Repository files navigation

Note: This is legacy code that almost certainly doesn't scrape the USPTO site correctly anymore. Only storing here in case the basic code is useful to someone.

PatentCrawler v1.5.11 README

John Warwick, 16 June 2006

PatentCrawler is a tool designed to allow researchers to download patents from the USPTO website and export relevant fields.

IMPORTANT WARNING

The USPTO website is a public resource and they activley ban IP-blocks that place too much strain on their servers. To this end, PatentCrawler is limited in the number of patents per day it will attempt to download. From their website, I infer that downloading less than 1000 patents per day will keep you in the safe range. However, if other users at your site are accessing the website, this may count towards your daily total. Use this program at your own risk; I do not know how to get you off of the blacklist.

Configuration

The first time you run PatentCrawler you should set two variables, available under the Configuration tab. First, create an empty folder and specify the path to this folder as your Cache Path (using the Set button). This folder will store the raw html of any patents you download. The cache is designed to hold files from multiple search sets, allowing you to significantly speed up your searches if they contain overlapping patent numbers. The cache is also used to generate exported data. Be sure to select this path in a location that will not be delete. Next, set the number of patents per day that PatentCrawler will attempt to download. This setting only signifies the rate at which patents will be downloaded, it does not keep a static counter across invocations of the application. The default is 500.

Usage

To use PatentCrawler, enter a search string (the same format as used on the USPTO Advanced Search site), and check or uncheck the Add Referenced By Patents checkbox (if you wish to include all US Patents that reference a patent in the search results) and the Add US References checkbox (if you wish to include all US patents cited by a patent in the search results). The click the Search button. After all of the search results are downloaded and parsed, the Search Set field of the application will update indicating the number of patents returned.

Next, press the Start button. PatentCrawler will now begin downloading patents from the USPTO site and placing them in the cache. The time until the next download and the total estimated time are displayed, as well as any status and error messages that may be generated by the download. You may pause the downloading at any time by pressing the Stop button. To restart, press Start again. The result pages that were imported may be saved as a Search Set. Choose File->Save to specify a file to hold the list of imported patent numbers. You may re-open saved searches using the File->Open command.

If the Add Referenced By Patents checkbox is selected, PatentCrawler will store each patent number in the starting search set in a separate list. This size of this list is displayed in the Remaining references field. Only after all the patents in the initial search set and all of the citations in those patents (if the Add US References checkbox is selected) are downloaded, then the referenced by patents will initiate their search. As more patents are discovered, they are added to the patents remaining list, and will be processed before more referenced by searches are carried out.

Export

When you have downloaded at least one patent into the cache, the Export button in the Export tab becomes active. Pressing this button traverses the list of downloaded patents in the current search set and opens those files from the cache. From these files, the fields specified in the Export tab are written to a tab-delimited file which the user selects from a file dialog.

In addition, there is an Export Special button which generates a report-style output file. This file is not configurable from the application.

Known Bugs

  • Not storing error generating http requests in a separate list, could loop endlessly
  • Can't differentiate between Delaware and Germany when exporting patents

About

Web scraper for USPTO patent site

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published