2018-08-24

Burp2.0の新しいクローラーが楽しみ　Burp's new crawler

Burpが10年越しにメジャーバージョンアップを予定しています。

いつもお世話になっているので、いろいろ機能が増えると助かります。

今回は、その中でも一番注目している「クローラー」の記事を翻訳しました。

それでは本文です。意訳している部分もありますので、間違いがあればお助けください。

Burp's current Spider tool has a number of significant limitations that prevent it from working effectively with modern web sites. Its core model is based on outdated assumptions about how web sites work, which don't apply today.

現在のSpiderツールは、古いWebサイトの仕組みをモデルとして作られていますが、昨今の世に出ているWebサイトには当てはまりません。現代のWebサイトを効果的にクロールするには重大な制限があります。

SpiderはWebサイトの規模感を図る程度でしか使えないだろうと思っていましたが、今回のバージョンアップで主戦力になるかもしれないと期待しています！

The Spider tool maintains a queue of pending requests. It works through the queue issuing each request, looks in the responses for new links and forms, and adds the relevant requests to the queue. This approach worked pretty well for sites that have unique and stable URLs for each function, use simple cookie-based session handling, return deterministic content in each response, and contain no server-side state. But most applications today aren't like that. So we're replacing the Spider tool with something better.

現在のSpiderツールは保留中のリクエストを保持しています。これは、リクエストを送信した際に、新しいリンクやフォームのレスポンスを調べ、関連するリクエストを保持しているということです。

このアプローチは、下記のようなWebサイトではうまく機能しました。

・機能ごとにユニークで安定したURLを使用しているサイト

・単純なCookieベースのセッション処理を使用しているサイト

・各レスポンスで確定的なコンテンツを返すサイト

・サーバー側の状態を含まないサイト

しかし、現代のWebサイトはそうではありません。そこでSpiderツールをより良いものに置き換えています。

確かに最近のサイトは、機能が違ってもURLが同じだったり、レスポンスの内容もリクエストによって大きく変わってしまうことが多いです。

単純なCookieベースじゃないセッション処理っていうのは、どういうものなんでしょうか。

Burp's new crawler uses a radically different model. It navigates around a target application in the same way as a user with a browser, by clicking links and submitting input. It constructs a map of the application's content and functionality in the form of a directed graph, representing the different locations in the application and the links between those locations:

新しいクローラは根本的に異なるモデルを使用します。

リンクをクリックしたり、フォームに入力したりすることによって、ブラウザを使用するユーザーと同じ方法で、対象のWebサイトを巡回するということです。

Webサイトの画面間やリンクによる画面遷移を有向グラフで表し、Webサイトのコンテンツと機能のマップを作成します。

直接的に保持しているリクエストを送信するのではなく、きちんとリンククリックなどしてe2e的な感じでクロールしていくってことですね。

そして、目玉の「マップ作成」これが超楽しみです！

f:id:madamadaikeru222san:20180824160429p:plain

akes no assumptions about the URL structure used by the application. Locations are identified (and re-identified later) based on their contents, not the URL that was used to reach them. This enables the crawler to reliably handle modern applications that place ephemeral data such as CSRF tokens or cache-busters into URL paths. Even if the entire URL within each link changes on every occasion, the crawler still constructs an accurate map:

新しいクローラでは、URL構造は関係ありません。Locationは、そこに到達するまでに遷移してきたURLではなく、Locationの内容に基づいて識別されます。これによりクローラは、CSRF トークンやcache-busterなどの一時的なデータをURLパスに配置するWebサイトでも確実に処理できます。毎回各リンク内のURL全体が変更されても、クローラは正確なマップを作成します。

ややこしいけど、巡回するURLに可変的なデータが含まれている場合でも、きちんと巡回できますってことでしょうか。

今までだと可変データごとURLを保持していたので、使用期限が切れたデータが付加されたURLにアクセスして失敗していたことが解消されるとなれば、クローラーにおける見落としは減る気がします。

The approach also allows the new crawler to handle applications that use the same URL to reach different locations, based on the state of the application or the user's interaction with it:

Webサイトの状態やユーザとの相互作用に基づいて、同じURLを使用して異なる画面にアクセスするWebサイトを処理できます。

シングルページアプリケーションでもクローラ使えるのは大きい進歩ですね。

The old Spider tracked its remaining work using a queue of pending requests. The new crawler has to track its remaining work in a different way.

現在のSpiderツールは保留中のリクエストを使用して残りの巡回箇所を追跡しています。新しいクローラは、残りの作業を別の方法で追跡しなければいけません。

現在方法だと、最近のWebサイトではうまく動かないんでしたね。

As the crawler navigates around and builds up coverage of the target application, it tracks the edges in the graph that have not been completed. These represent the links (or other navigational transitions) that have been observed within the application but not yet visited. But the crawler never "jumps" to a pending link and visits it out of context. Instead, it either navigates via links from its current location, or reverts to the start location and navigates from there. This replicates as closely as possible the actions of a normal user with a browser:

クローラは対象のWebサイトを巡回しながら進捗管理し、完了していない箇所を追跡します。これらは、Webサイト内で観察されたがまだ遷移していないリンク等を表す。しかし、クローラは決してこれらのリンクに直接「ジャンプ」しません。代わりに、現在のLocationからのリンクを経由して遷移するか、開始場所に戻ってそこから遷移します。これが、ユーザのブラウザ操作に可能な限り近づけた遷移方法です。

機械的なリクエスト送信ではなく、人為的なブラウザ操作を模すことで、セッション切れなど今までのSpiderでネックとなっていた個所をフォローしているようですね。

例えば右上にいるクローラ（ロボット印）が右下のロボット印に遷移したい場合でも、一度開始場所に戻ってきちんと遷移をたどってくれるということです。

f:id:madamadaikeru222san:20180824164955p:plain