How to build a Web-Crawler for OSINT -

How exactly can you build web crawlers to help with OSINT Investigations? What an amazing question. This is a question we asked ourselves over a decade ago. Hence…the birth of UserSearch.org!

In this article, we’re going to talk a little about how you can build your own web crawler using Python (don’t worry, it’s going to be basic). After 10 minutes of reading, you may even have your own Python web crawler to call your own!

We’ve been asked a number of times how our reverse username search works. Trying to explain it causes a blank face. Instead, were going to do a post on how web crawlers can be used for open-source research (ok and how our website works).

Health Warning: Very technical. Requires knowledge in PHP and Python. If you don’t have it, keep reading and you may get an idea of how search engines work (or ours at least). Strongly recommend you read our previous posts first (Does a website know you?, Can you find my hidden email address? and New Search engine on the blog). It will give you a grounding on how the manual techniques work, so when you start reading our code – it will click!

What is a Web-Crawler?

First, what is a web crawler? (as per Wikipedia https://en.wikipedia.org/wiki/Web_crawler).

“A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider,[1] an ant, an automatic indexer,[2], or (in the FOAF software context) a Web scutter.[3]“

Types of Crawlers?

This explanation is by no way official but from how our site works, we believe there are three distinct crawler bots. There are crawlers that are extremely sophisticated and crawl, indexes, and remember where it’s been (google for example), and there’s directional. Directional crawlers are more specific and have a particular task that needs completing. Still very sophisticated, but designed for a handful of specific, targeted results.

Usersearch.org has been designed mostly on directed crawlers. We’ve built the crawlers from the ground up, a blank page. As you can see, our web crawlers have been specifically designed to find user names, pseudo names, email addresses, phone numbers, and website social stats across approximately 500 social networks and forums. And finally, there are Omni-directional crawlers that are extremely specific with no wiggle room, a bit like accessing an API with very few moving parts. We use a few of these too.

Techniques

Reverse lookups are the cornerstone to search engines. They all started from web crawling technologies.

If you’ve read our previous posts (Does a website know you?, Can you find my hidden email address? and New Search engine on the blog) you will see the manual process is really not that difficult to work out if a website knows a particular email address. But, if you start doing that process manually on 10, 15, 20, or 100+ websites – it gets boring, fast.

The solution to this, of course, is to build a directional-web crawler (our own definition!). Directional-Web crawlers (a bit like a Directional satellite where a line of sight must exist for two satellites to communicate). Our directional web crawlers know exactly where to go, where to look, what to do, and where to put it. Of course, we need to put some safety measures in place for various conditions such as an unexpected change in the targeted web page or a page that is temporary not responding. But that’s just part of the fun. We need to let the crawler know what to do, should some data get clogged into the system that it may not have expected (such as a space between two words such as Fred Hammer, rather than Fred_Hammer).

Basic Python Modules for web-crawling

So, we are not going to cover how to install python or how to test the modules. We’re hoping you already know this (if you don’t, we can do a future post if you ask us). We’re jumping right in.

Good web crawlers technologies:

Anyone can build an OSINT lookup tool with a little coding

-Scrapey

-Silenium

-Mechanise

Scrapey we don’t like too much as it tries to do everything for you. We’d class this as a generic web crawler (not directional / omnidirectional). So it’s not much use for us, but it’s good for mass-web crawling projects.

Silenium is great if you want a ‘point and click interface (Omni-directional). You can build a little program in a matter of minutes that will do simple actions like entering you’re credentialed into a web-based email account, signing in, and sending email repeatability. Pretty cool as you actually see the actions taking place (mouse movement, firefox opening, page loading, email being typed, etc). Good for presentations.

Mechanize (Omni and directional) is a module that allows you to interact with websites similar to Silenium but in the background. This means you can multi-process thousands of iterations at the same time, independently of each other (we use mechanise). You can then take the data captured from the crawler and use the power of python to interact with the data or store it in a database. It’s probably the best free solution on the market if you can use python.

Python Mechanise Basics – crawling the web

Mechanise can browse to a web page and access a specific web form, and then enter details into that form and submit. It can then take the result of that form submission and do something else with it – whether that stores the result in a database or keep filling in the next page of a form and continue.

The below python code simply starts to mechanise by creating a mechanize object (br = mechanize.Browser() ) and then opens a website (response = br.open('some_site');. The program then goes on to list all links on that page (br.links ).

From here it starts a ‘for’ loop that just iterates through each link found on that page, opens that link, and lists all the links of that opened page. There you have it, in under 10 lines you have a web-crawler that will open a webpage and crawl all the links on that page and then continue onto the next links and continue. It’s basically walking through every single page on a particular page.

**************************************************************************************************************

br = mechanize.Browser() # Creates object
response = br.open('some_site'); # opens site and puts the value in 'response' varable

current_links = list(br.links()) # list links

for link in current_links:
  br.follow_link(link) # opens link
  sub_links = list(br.links()) #get links from opened page
  for link in sub_links: # opens the next lot of links
    br.follow_link(link) # follow the next lot of links

**************************************************************************************************************

Cheatsheet for building quick web crawler functions

So the above may be a little confusing. So below is a step-by-step guide on creating a crawler that will enter some details into a form and then submit it. You can see from the below that you need some HTML knowledge in locating the forum variable names.

Create a browser object and give it some optional settings.

import mechanize
br = mechanize.Browser()
br.set_all_readonly(False)    # allow everything to be written to
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this
br.addheaders =   	      	# [('User-agent', 'Firefox')]

Open a webpage and inspect its contents

response = br.open(url)
print response.read()      # the text of the page
response1 = br.response()  # get the response again
print response1.read()     # can apply lxml.html.fromstring()

List the forms that are in the page

for form in br.forms():
    print "Form name:", form.name
    print form

To go on the mechanize browser object must have a form selected

br.select_form("form1")         # works when form has a name
br.form = list(br.forms())[0]  # use when form is unnamed

Iterate through the controls in the form.

for control in br.form.controls:
    print control
    print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])

Controls can be found by name

control = br.form.find_control("controlname")

Having a select control tells you what values can be selected

if control.type == "select":  # means it's class ClientForm.SelectControl
    for item in control.items:
    print " name=%s values=%s" % (item.name, str([label.text  for label in item.get_labels()]))

Because ‘Select’ type controls can have multiple selections, they must be set with a list, even if it is one element.

print control.value
print control  # selected value is starred
control.value = ["ItemName"]
print control
br[control.name] = ["ItemName"]  # equivalent and more normal

Controls can be set to readonly and disabled.

control.readonly = False
control.disabled = True

OR disable all of them like so

for control in br.form.controls:
   if control.type == "submit":
       control.disabled = True

When your form is complete you can submit

response = br.submit()
print response.read()
br.back()   # go back

Our example:

So the below is an example we’ve created. It’s not part of our website but it works just fine.
This crawler is designed to jump through several web-forms filling out data to cause a result at the end
(Hint: https://usersearch.org/blog/index.php/2015/09/28/does-a-dating-website-know-you/). Not commented I’m afraid but its self explanatory if you’ve read the above cheat cheats.

def email_check_complex(email, location, site_name, form_selection, search_term, number_of_inputs, input_one, input_two, input_three, success_value, page_jump_through): # click through site check
    # Browser
    site = mechanize.Browser(factory=mechanize.RobustFactory())
    # Cookie Jar
    # Browser options
    site.set_handle_equiv(True)
    site.set_handle_gzip(False)
    site.set_handle_redirect(True)
    site.set_handle_referer(True)
    site.set_handle_robots(False)
    # Follows refresh 0 but not hangs on refresh > 0
    site.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=5)
    # User-Agent
    site.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]    
    
    site_opened = site.open(location) #Open site
    #site.select_form(nr=form_selection) #Select form number from value in-putted

    if page_jump_through == 1: # jump through 1 pages
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()
    elif page_jump_through == 2: # jump through 2 pages
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()
    elif page_jump_through == 3: # jump through 3 pages
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()
    elif page_jump_through == 4: # jump through 4 pages
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.submit()
        site.select_form(nr=form_selection)
        site.form[input_three] = email_chk
        site.submit()

So the above script has sent some commands to a website form and the response will be the result of that submission (in our case a result saying ‘Email already registered). Now, you need to build something to retrieve this response and do something clever with it (this is where your OSINT skills come in handy).

Now you’re an OSINT crawler coder!

Now, if you’ve read this far and if nothing else but 5-10% has sunk in then well done, we’re happy. If you’ve reached this far and like us web-crawling makes you want to get up in the morning and code…you may want to read our next post.

We’ll build on the examples we’ve shown you and put some code together on how you can retrieve the final response, do some snazzy stuff with it and check it for particular keywords that will determine if your expected email exists at a given location or not. THEN you can continue and even automate what you would do next (you’re making an auto-open source searcher, well done you!) Who needs a team when you can code!

Give it a go yourself and compare your code with ours next week!

And that’s all we have time for I’m afraid. Any questions, just post/email and we’ll try and answer.