How to build a Web-Crawler for OSINT

We’ve been asked a number of times how our website works.  Trying to explain it causes a blank face.  Instead, were going to do a post on how web crawlers can be used for open source research (ok and how our website works).

Health Warning : Very technical.  Requires knowledge in PHP and Python.  If you don’t have it, keep reading and you may get an idea how search engines work (or ours at least).  Strongly recommend you read our previous posts first (Does a websites know you?, Can you find my hidden email address? and New Search engine on the blog).  It will give you a grounding on how the manual techniques work, so when you start reading our code – it will click!

What is a Web-Crawler?

First, what is a web crawler? (as per Wikipedia https://en.wikipedia.org/wiki/Web_crawler).

“A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider,[1] an ant, an automatic indexer,[2] or (in the FOAF software context) a Web scutter.[3]

Types of Crawlers?

This explanation is by no way official but from how our site works, we believe there are three distinct crawler bots.  There are crawlers which are extremely sophisticated and crawls, indexes and remembers where it’s been (google for example), and there’s directional.  Directional crawlers are more specific and have a particular task that needs completing.  Still very sophisticated, but designed for a handful of specific, targeted results.  Usersearch.org has been designed mostly on directed-crawlers.  We’ve built the crawlers from the ground up, a blank page.  As you can see, our web crawlers have been specifically designed to find user-names, pseudo names, email addresses, phone numbers and website social stats across approximately 500 social networks and forums.  And finally, there are Omni-directional crawlers which are extremely specific with no wiggle room, a bit like accessing an API with very little moving parts.  We use a few of these too.

Techniques

If you’ve read our previous posts (Does a websites know you?, Can you find my hidden email address? and New Search engine on the blog) you will see the manual process is really not that difficult to work out if a website knows a particular email address.  But, if you start doing that process manually on 10, 15, 20 or 100+ websites – it gets boring, fast.  The solution to this of course, is to build a directional-web crawler (our own definition!).  Directional-Web crawlers (bit like a Directional satellite where a line of site must exist for two satellites to communicate).  Our directional web crawlers know exactly where to go, where to look, what to do and where to put it.  Of course, we need to put some safety measures in place for various conditions such as a unexpected change tumblr_ljmeveF10t1qf00w4in the targeted web page or a page is temporary not responding.  But that’s just part of the fun.  We need to let the crawler know what to do, should some data get clogged into the system that it may not of expected (such as a space between two words such as Fred Hammer, rather than Fred_Hammer).

 

 

 

 

Basic Python Modules for web-crawling

So, we are not going to cover how to install python or how to test the modules.  Were hoping you already know this (if you dont, we can do a future post if you ask us).  We’re jumping right in.

Good webcrawlers technologies:

-Scrapey25

-Silenium

-Mechanise

Scrapey we don’t like too much as it tries to do everything for you.  We’d class this as a generic web crawler (not directional / omni-directional).  So its not much use for us, but its good for mass-webcrawling projects.  Silenium is great if you want a ‘point and click’ interface (omni-directional).  You can build a little program in a matter of minutes that will do simple actions like enter you’re credentials into a web-based email account, sign in and send an email repeatability.  Pretty cool as you actually see the actions taking place (mouse movement, firefox opening, page loading, email being typed etc).  Good for presentations.  Mechanise (omni and directional) is a module that allows you to interact with websites similar to Silenium but in the background.  This means you can multi-process thousands of iterations at the same time, independently of each other (we use mechanise).  You can then take the data captured from the crawler and use the power of python to interact with the data or store it in a database.  It’s probably the best free solution on the market, if you can use python.

Python Mechanise Basics

Mechanise can browse to a web page and access a specific web form, and then enter details into that form and submit.  It can then take the result of that form submission and do something else with it – whether that be store the result in a database or keep filling in the next page of a form and continue.

The below python code simply starts mechanise by creating a mechanise object (br = mechanize.Browser() ) and then opens a website (response = br.open('some_site');.  The program then goes on to list all links on that page (br.links ).

From here it starts a ‘for’ loop that just iterates through each link found on that page, opens that link and lists all the links of that opened page.  There you have it, in under 10 lines you have a web-crawler that will open a webpage and crawl all the links on that page and then continue onto the next links and continue.  Its basically walking through every single page on a particular page.

 

**************************************************************************************************************

br = mechanize.Browser() # Creates object
response = br.open('some_site'); # opens site and puts the value in 'response' varable

current_links = list(br.links()) # list links

for link in current_links:
  br.follow_link(link) # opens link
  sub_links = list(br.links()) #get links from opened page
  for link in sub_links: # opens the next lot of links
    br.follow_link(link) # follow the next lot of links

**************************************************************************************************************

Mechanise Cheat sheet

So the above may be a little confusing.  So below is a step by step guide on creating a crawler that will enter some details into a form and then submit it.  You can see from the below that you need some HTML knowledge in locating the forum variable names.

 

  • Create a browser object and give it some optional settings.
import mechanize
br = mechanize.Browser()
br.set_all_readonly(False)    # allow everything to be written to
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this
br.addheaders =   	      	# [('User-agent', 'Firefox')]


  • Open a webpage and inspect its contents
response = br.open(url)
print response.read()      # the text of the page
response1 = br.response()  # get the response again
print response1.read()     # can apply lxml.html.fromstring()

  • List the forms that are in the page
for form in br.forms():
    print "Form name:", form.name
    print form

  • To go on the mechanize browser object must have a form selected
br.select_form("form1")         # works when form has a name
br.form = list(br.forms())[0]  # use when form is unnamed
  • Iterate through the controls in the form.
for control in br.form.controls:
    print control
    print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])
  • Controls can be found by name
control = br.form.find_control("controlname")

Having a select control tells you what values can be selected
if control.type == "select":  # means it's class ClientForm.SelectControl
    for item in control.items:
    print " name=%s values=%s" % (item.name, str([label.text  for label in item.get_labels()]))

  • Because ‘Select’ type controls can have multiple selections, they must be set with a list, even if it is one element.
print control.value
print control  # selected value is starred
control.value = ["ItemName"]
print control
br[control.name] = ["ItemName"]  # equivalent and more normal
  • Controls can be set to readonly and disabled.
control.readonly = False
control.disabled = True
  • OR disable all of them like so
for control in br.form.controls:
   if control.type == "submit":
       control.disabled = True
  • When your form is complete you can submit
    response = br.submit()
    print response.read()
    br.back()   # go back

Our example:
So the below an example we’ve created. Its not part of our website but it works just fine.
This crawler is designed to jump through several web-forms filling out data to cause a result at the end
(Hint: https://usersearch.org/blog/index.php/2015/09/28/does-a-dating-website-know-you/). Not commented I’m afriad but its
self expanitry if you’ve read the above cheet cheats.

 


def email_check_complex(email, location, site_name, form_selection, search_term, number_of_inputs, input_one, input_two, input_three, success_value, page_jump_through): # click through site check
    # Browser
    site = mechanize.Browser(factory=mechanize.RobustFactory())
    # Cookie Jar
    # Browser options
    site.set_handle_equiv(True)
    site.set_handle_gzip(False)
    site.set_handle_redirect(True)
    site.set_handle_referer(True)
    site.set_handle_robots(False)
    # Follows refresh 0 but not hangs on refresh > 0
    site.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=5)
    # User-Agent
    site.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]    
    
    site_opened = site.open(location) #Open site
    #site.select_form(nr=form_selection) #Select form number from value in-putted

    if page_jump_through == 1: # jump through 1 pages
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()
    elif page_jump_through == 2: # jump through 2 pages
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()
    elif page_jump_through == 3: # jump through 3 pages
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()
    elif page_jump_through == 4: # jump through 4 pages
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()

 

So the above script has sent some commands to a website form and the response will be the result of that submission (in our case a result saying ‘Email already registered’).  Now, you need to build something to retrieve this response and do something clever with it (this is where your OSINT skills come in handy).

Now, if you’ve read this far and if nothing else but 5-10% has sunk in then well done, we’re happy.  If you’ve reached this far and like us web-crawling makes you want to get up in the morning and code…you may want to read our next post.  We’ll build on the examples we’ve shown you and put some code together on how you can retrieve the final response, do some snazzy stuff with it and check it for particular keywords that will determine if your expected email exists at a given location or not.  THEN you can continue and even automate what you would do next (your making an auto-open source searcher, well done you!)  Who needs a team when you can code!

Give it a go yourself and compare your code with ours next week!

And that’s all we have time for I’m afraid. Any questions, just post / email and we’ll try and answer.

smiley-face-thumbs-up-clipart-acqbqAzcM

5 thoughts on “How to build a Web-Crawler for OSINT

  1. I never realised this could be done with so little programming! such a useful post. I look forward to the next one.

  2. This was a bit over my head but I got the concept. I’d really like a basic one for new starters if possible?

Leave a Reply

Your email address will not be published. Required fields are marked *