In this post I’m continuing with the implementation of the Get-WebForBrokenLinks.

Get-WebForBrokenLinks -Web $subweb

Before we can have a look at finding all broken links within a site, we will need to identify where the broken links may be stored.  A quick look at SharePoint gives me the following lists

  • List items
  • Pages
  • Documents in libraries
  • Web Parts

For now I’m going to look at the easiest option. List items.

I’m going to start with the function. And I’m making the Lists available using the Load and ExecuteQuery:

Function Get-WebForBrokenLinks {
[CmdletBinding()] param( [Parameter(Mandatory=$True,ValueFromPipeline=$True, ValueFromPipelineByPropertyName=$True,HelpMessage='Web to be scanned for broken links')] [Microsoft.SharePoint.Client.Web] $Web )
begin{
Write-Host "Scanning: "  $Web.Url
}
process{
$web.Context.Load($web.Lists)
$web.Context.ExecuteQuery();
... # This is where the rest of the code needs to appear
}
end {
Write-Host "Compelted scanning: "  $Web.Url
}
}

Now I need to go through the lists and the list items

ForEach ($list in $web.Lists) {
$items = Get-PnPListItem -List $list
foreach ($item in $items) {
....
}
}

So now I’m getting the items for all of my lists. Now it becomes important to understand what type of fields SharePoint has as we step through all the fields in all the items of all the lists.

foreach ($fieldValue in $item.FieldValues){

foreach ($value in $fieldValue.Values) {
if ($value -ne $null) {
switch ($value.GetType().Name){
....
}
}
}

Now all we need to do is handle all data types that may contain urls. So what are the data types? And which ones could possibly contain a url?

To find this out I added a default option to my switch:

default {
$type = $value.GetType()
Write-Error "Not supported type: $type"
}

Then I kept rerunning my script until I collected all the datatypes. I found the following data types in my lists:

  • Guid
  • Int32
  • ContentTypeId
  • DateTime
  • FieldUserValue
  • FieldLookupValue
  • Boolean
  • Double
  • String[]
  • FieldUrlValue
  • String

Most of these couldn’t possibly contain a url. e.g. Guid. So building up my switch I get the following script:

switch ($value.GetType().Name){
"Guid" { # Ignore }
"Int32" { # Ignore }
"ContentTypeId" { # Ignore }
"DateTime" { # Ignore }
"FieldUserValue" { # Ignore }
"FieldLookupValue" { # Ignore }
"Boolean" { # Ignore }
"Double" { # Ignore }
"String[]" { ...
}
"FieldUrlValue" { ...
}
"String" { ...
}
default {
$type = $value.GetType()
Write-Error "Not supported type: $type"
}
}

Ok, so so far I only need to write some code for 3 field types. I’m going to start with FieldUrlValue. The reason why this type is easier than String is because the String field may contain other text as well:

if ($value.Url.Contains("https://") -or $value.Url.Contains("http://") ) {
try {
if ((invoke-webrequest $value -DisableKeepAlive -UseBasicParsing -Method head).StatusCode -ne 200){
Write-Host "Broken link:" $value.Url
}
}
catch
{
Write-Host "Broken link:" $value.Url
}
}

So we are now ready to answer the next critical question. How do I recognize a Url in text. I’ve seen solution with Regular expressions. And although this might be a good way ( if you can get it to work!) I’m hoping that I have found an easier way.

It’s all started by assuming that a Url doesn’t contain a space. So if I have a text with a url then a split by space would give me an array:

$string = "text https://veenstra.me.uk/anylocation/anypage.html some more text"
$string.split(" ")

Ok, This will almost work, but not if there isn’t a space before or after the url. So other than spaces what else could be splitting urls from text.
I’m first having a look at the html

<a href="http://testurl">Link</a>

As all I’m interested in is getting a variable with a clean url in it, I could just split by ” as well.

for string fields this results in the following piece of code:

if ($value.Contains("https://") -or $value.Contains("http://") -or $value.Contains("http://") -or $value.Contains("https://") ) {
try {
$words = $value.split(" ")
foreach ($word in $words) {
$quotesplitwords = $word.split("`"")
foreach ($quotesplitword in $quotesplitwords)
{
if ($quotesplitword.Contains("https://") -or $quotesplitword.Contains("http://") -or $quotesplitword.Contains("http://") -or $quotesplitword.Contains("https://") ) {
if ((invoke-webrequest $quotesplitword.Replace(":", ":") -DisableKeepAlive -UseBasicParsing -Method head).StatusCode -ne 200){
Write-Host "Broken link:" $quotesplitword
}
}
}
}
}
catch
{
Write-Host "Broken link:" $quotesplitword
}
}

This code now only gives one false positive:

If urls appear in text, without these being actual clickable hyperlinks then the script will flag them up. Actually any text that contains http will be flagged up as a broken link. Well for now I’m going to decide to live with that. Not sure though if this will be ok for the leftover locations that may contain broken links.

So this now covers finding broken urls within list items. there is still quite a bit of work to do.

  • Pages
  • Documents in libraries
  • Web Parts

But these elements will be done within the next part of this series. Now that we have code that finds Urls within text we are half way there.

 

 

Advertisements