Build an E-mail + Phone Number Scraper in 7 Minutes with Google Sheets, Standard Library and Node.js

Keith Horwood
9 min readJul 10, 2019

Web crawling! Since the early days of the world wide web, people have been crawling webpages to categorize the information contained within. This is the main function of search engines like Google and Bing. But sometimes we want to automate our own crawlers — getting information like e-mails and phone numbers can be helpful for finding contact or sales information for customers.

Setting up Build on Standard Library is Easy!

For those who aren’t familiar, Build on Standard Library is a product that allows you to automatically generate Node.js code-based workflows in response to events, as well as link API accounts together in minutes. We’ll be using it to query Google Sheets like a database using the googlesheets.query API, then scrape E-mails and Phone Numbers with our crawler API from any domain on a once per minute schedule.

What You’ll Need

Minute 1: Set up your Google Spreadsheet

To get started, we’ll first set up the Google Spreadsheet we want to automatically enrich with E-mail and Phone Number information. Simply visit https://sheets.google.com/ and Create a New Blank Spreadsheet. We recommend filling it with the following information to begin with.

NOTE: You must name your columns as described here for this to work!

Set up a new Google Sheet

Please make sure the first row (index 1) is filled with the headers: Domain, E-mails, Phone Numbers, and Complete. This is important, because we’ll be using these headers with the Google Sheets query API on Standard Library to treat our Google Sheet like a database. You can fill in the Domains with the following test values for now:

twilio.com
salesforce.com
stripe.com
slack.com
newrelic.com
airtable.com
facebook.com
twitter.com
producthunt.com
stdlib.com

We recommend naming your new Spreadsheet something like Scrape E-mails and Phone Numbers to begin with. That’s it! That’s all you need on the Google Sheets side of things.

Minute 2: Set up your Crawling / Scraping Workflow

Next, we’ll head over to https://build.stdlib.com (Build on Standard Library) to create our crawling workflow. You’ll want to pick the following options:

Event:
Scheduler → Once a minute

Actions:
Google Sheets → Select Rows from a Spreadsheet by querying it like a DB
Crawler → Scrape a domain for email addresses
Crawler → Scrape a domain for phone numbers
Google Sheets → Replace Rows in a Spreadsheet by querying it like a DB

Click “Create Workflow” to continue once set up

Once these options have been selected, you can click Create Workflow. If you have not registered for Standard Library yet, you will be prompted to sign up (or log in) at this stage.

Minute 3: Link your Google Spreadsheet

The next step is to Link your Google Spreadsheet. Standard Library, on top of easily generating Node.js code-based workflows for you, also provides something called Identity Tokens that securely store and manage credentials to third party APIs for you without you needing to read hours of documentation. After clicking Create Workflow on the last step, we should see a screen like this:

You can link your Google Spreadsheet now

From here, we’ll want to click the Link Resource button to the right of Google Sheets…

Click “Link Resource”

And we’ll be prompted with this screen:

Click “Link New Resource”

Click Link New Resource to proceed, which will start the Google OAuth flow. Make sure you choose the account you created the spreadsheet with!

Choose your G-mail Account

You should now see a progress bar indicating account link progress — hold tight, we’re almost done!

Linking data from Google Sheets

Aha! When this step completes, a list of your available spreadsheets will pop up. Choose Scrape E-mails and Phone Numbers (or whatever you named your sheet) to proceed.

Click “Finish”

You can now click Finish to finish linking your spreadsheet. You’ll see another progress indicator…

Linking…

And be brought back to the Identity Management screen. It should look something like this, with Identity Generated on the bottom right of the Link Accounts prompt.

Hit Next to Continue

We can now click Next to continue, and start prototyping our workflow!

Minute 4: Prototype your Workflow

You should now be brought to a screen that looks like this — it’s a lot, because we have a big workflow!

Manage your Workflow

To make sure this is all working correctly, hit the green Pause buttons next to the bottom three actions so your workflow looks something like this:

Pause the last three workflow steps

Now scroll down on this workflow editor, and make sure you’re editing the googlesheets.query step (it should not be paused). In the range field, enter A1:Z1000, this is the A1 notation of the total size of the spreadsheet we’ll turn into a database. The FIRST_EMPTY_ROW selection means our data will be truncated (table limited) at the first empty row in the spreadsheet — we won’t be dealing with hundreds of empty rows.

Now, keep scrolling down to the very bottom until you see this helpful button at the bottom of the workflow editor:

Click Run with Test Event

To make sure we’re talking to the Google Sheets API properly, click Run with Test Event. If everything works, you should see a result like the following:

Great! It worked.

Each Google Sheet row has now been turned into an object with the column names as the field names or “keys” of the object. This allows us to manipulate spreadsheets a lot more easily.

Minute 5: Customize our Workflow Code

We can play around more with the workflow editor if we want, but let’s just get to the solution! Behind the scenes, this workflow editor has been generating code for us. If you scroll back up to the workflow editor, you might notice a Developer Mode toggle switch on the bottom right:

See the Developer Mode switch?

Click it to toggle it to ON. Your workflow editor should now be locked — however, you’ll see there’s a whole bunch of Node.js code we’ve generated in the background! You can edit this all you want before shipping.

We now have access to the code we’ve been generating

Now, we’ll copy and paste the following code. This code performs the following actions:

  • Retrieves the first two rows where Complete is not TRUE
  • Finds all e-mails associated with each Domain
  • Finds all phone numbers associated with each Domain
  • Sets Complete to TRUE
  • Replaces all Google Sheets rows with the newly crawled data

Be sure to copy and paste this code exactly as-is into the code editor provided:

Once copied, scroll to the bottom of the editor and hit Run with Test Event again.

Click Run with Test Event once more

You should see a result like this:

It worked!

And that’s it! If we go back to our Google Sheet, we should see the first two rows filled out:

Nice! We got some data.

We see here that Salesforce.com didn’t return any E-mails — it means we couldn’t find any. Unfortunately, you won’t match data for every single website: but we can get a lot. That’s it! We can now ship this workflow.

Minute 6: Ship our Workflow!

After running this workflow successfully in the last step, the blue Next button on the bottom right should be enabled, click it to proceed.

Note: If the button is disabled, make sure you Run with Test Event and don’t modify any code before proceeding!

Click the blue Next button

After clicking Next, you’ll be brought to this screen:

Name your Project and Workflow Filename

The name of your Project should be automatically generated, along with the filename in which we’ll store our code-based workflow. Simply click Alright, Ship it! to proceed. You’ll see the following prompt, which should take about 10s to complete, give or take.

Click View Project to proceed

That’s it! Our automatic crawler and scraper has been shipped. But you might want to make modifications or changes, so…

Minute 7: Managing our Workflow Project

Once you click View Project you’ll be brought to a project management screen. From here you can do all sorts of cool things — you can learn more about modifying workflows in our documentation. But we’ll start you with the basics. You should be looking at this screen:

Click dev (click to manage)

Simply click dev (click to manage) to see your workflow development environment. You’ll see a summary of your workflow project and the API actions it’s taking. You’ll notice this is running on a schedule — it’ll automatically run once per minute.

You can click Edit Code (bottom right) to modify workflow code

From here, we can edit this workflow code easily by clicking Edit Code to open up Code on Standard Library:

Edit your Code, click Up to re-deploy!

This is our in-browser editing environment. To modify the workflow code, simply click Up (the blue button in the bottom right). For example, you might want to change the limit: {'count': 2, 'offset': 0} on lines 22 to 25 to enrich more e-mails / phone numbers at a time.

Minute 7.5: Changing Your Google Sheet

You may decide you want to change which Google Sheet is linked to your project. At any time, navigate to the Authentication tab for your project environment:

Click the Edit button under Actions

Click the Edit button under Actions to bring up the option to unassign your Google Sheet.

Click Unassign to remove this Sheet.

Click Unassign and the spreadsheet will be unassigned. The workflow will be effectively defunct while not linked to a Google Sheet, but you can re-link the same sheet (or add any additional ones) by simply clicking Link Resource once again:

Click Link Resource to assign a new Google Sheets

That’s it!

You now have full control over automating Google Spreadsheets like a Database. You may notice that your spreadsheet automatically fills itself out in the background:

Wow! Automatic.

Note that not all domains will return results. That’s okay! If you feel like digging deeper and writing your own crawler code instead of using our crawler APIs — you can customize your Node.js code ’til you’re happy!

Thanks for reading, and happy building! You can follow us for more updates on Twitter at @StdLibHQ or check out or blog, https://stdlib.com/blog for frequent updates.

Keith Horwood is the founder and CEO of Standard Library. When he’s not busy writing code, he’s waiting for Halo: Infinite to drop or playing too much Smash Bros. You can follow him on Twitter at @keithwhor.

--

--