c# parsing Html using HtmlAgilityPack

Posted by geethalakshmi on Wed, 09 Mar 2022 15:57:57 +0100

Htmlagility pack is an open source C# class library for rapid parsing of Html. Simply understand, it can transform Html into Node nodes according to XPATH just like parsing Xml, and support the adjustment of nodes and various attributes of nodes.

Portal: Official Website | Github source code

Load Html in multiple ways
There are three main loading methods: loading from network link, loading from string text and loading from file

var doc = new HtmlDocument();
//Load directly through url
doc = new HtmlWeb().Load("https://www.baidu.com/");
//Load by string
doc.LoadHtml(result);
//The encoding method of html file can be specified
doc.Load(@"c://index.html",Encoding.UTF8)
HtmlNode common method

The target node obtained by using SelectNodes() and SelectSingleNode() methods (similar to XmlDocument parsing XML format data) corresponds to HtmlNodeCollection and HtmlNode classes respectively.

"/ /" means to search from the root node, two slashes "/" means to search all childnodes; one slash "/" means to search only the childnodes of the first layer (i.e. not grandchild); the dot slash "/" means to search from the current node rather than the root node (only appears at the beginning of xpath)

be careful:

id class Attribute matching is case sensitive
xpath Matching subscripts start with 1

1. Select the corresponding node through attribute and path matching
var node = doc.DocumentNode;
 
//Select the div node that does not contain the class attribute
var result = node.SelectNodes(".//div[not(@class)]");
 
//Select the div node that does not contain the class and id attributes
var result = node.SelectNodes(".//div[not(@class) and not(@id)]");
 
//Select the span node that contains "expire" in the class
var result = node.SelectNodes(".//span[contains(@class,'expire')]");
 
//Select the span node that does not contain "expire" in the class
var result = node.SelectNodes(".//span[not(contains(@class,'expire'))]");
 
//Select the span node of class="expire"
var result = node.SelectNodes(".//span[@class='expire']");
 
//Select the first div node under the div node with id="expire"
var result = node.SelectSingleNode(".//div[@id='expire']/div[1]");
  1. Get node text content

According to different requirements, the corresponding text content can be obtained in different ways.
OuterHtml: returns all Html including the current node
InnerHtml: returns the Html of all child nodes in the current node
InnerText: returns the text content after removing all Html in the current node

<div id="title">
  <p>
   <a class="MainTitle" href="https://www.cnblogs. COM / cplemom / "rel =" external nofollow "rel =" external nofollow "rel =" external nofollow "> Fu Xiaohui</a>
  </p>
</div>

Take the Html above as an example

var node= doc.DocumentNode.SelectSingleNode("//div[@id='title'/p]");
 
node.OuterHtml; //Return result: < p > < a class = "maintitle" href=“ https://www.cnblogs.com/cplemom/ "Rel =" external nofollow "rel =" external nofollow "rel =" external nofollow "> Fu Xiaohui</a></p>
 
node.InnerHtml; //Return result: < a class = "maintitle" href=“ https://www.cnblogs.com/cplemom/ "Rel =" external nofollow "rel =" external nofollow "rel =" external nofollow "> Fu Xiaohui</a>
 
node.InnerText; //Return result: Fu Xiaohui
  1. Get / modify node attribute value

Taking the above Html as an example, we obtained a node with a label. We want to get the link address pointed to by the a tag and modify the address set for us. Here, take the href attribute as an example, which can also be used on attributes such as class/src/id.

var node= doc.DocumentNode.SelectSingleNode("//div[@id='title'/p/a]");
 
//The second parameter is the default value returned when the corresponding property cannot be found
var url = node.GetAttributeValue("href", "");//Return result: https://www.cnblogs.com/cplemom/
 
//Set attribute value
node.SetAttributeValue("href", "http://www.cplemom.com/");
 
//Get all attribute values
var list = node.Attributes.ToList();
  1. Delete / replace node

Continue to take the above Html as an example, and we get the node with a label.
For the content we don't need, we just need to call the node Remove method.

var node= doc.DocumentNode.SelectSingleNode("//div[@id='title'/p/a]");
 
node.Remove();//Delete node

A common scenario is that we need to remove the a tag, but keep the text of the a tag in the html context.
PS: the text in a tag is actually a node node of type text in HtmlDocument. Therefore, we can achieve our goal by deleting the a tag and retaining the text tag.

1
node.ParentNode.RemoveChild(node,true);
true means that only the a tag will be deleted for the child nodes that leave the a tag. Here, it means to retain the "Fu Xiaohui" text node; false means to delete this node together with all child nodes.

From another perspective, the current node node represents a single a tag. What if there are multiple a tags under the p tag to be processed, or the node node points to the p tag? Of course, we can do this by getting all the a tags and processing them in a loop, but is there any better way?

Here is an idea to get all the text content, create a new text node, and then replace the current node.

node.ParentNode.ReplaceChild(HtmlNode.CreateNode(item.InnerText), node);

Several common usage scenarios and Solutions

  1. Get all img Tags
//Get img tags in all child and descendant nodes through Descendants
var list = node.Descendants("img");
 
 
//Get all img tags through Xpath matching
var list = node.SelectNodes("//img");
  1. When accessing through url, you need to carry authentication information such as cookie s

Some pages need to carry authentication information to access, such as user center, order list, etc. at this time, html obtained directly through HtmlWeb class will be rejected. A simple way is to request the corresponding html content through HttpClient, and then load it with HtmlDocument. In fact, HtmlWeb is also an encapsulated HttpWebRequest for network requests, so it exposes a delegate to the outside to modify the request context.

var web = new HtmlWeb();
web.PreRequest = new HtmlWeb.PreRequestHandler(GetRequest);
var node = web.Load("https://www.cplemom.com/");
 
public static bool GetRequest(HttpWebRequest req)
{
  req.Headers.Add("Host", "www.cplemom.com");
    req.Headers.Add("Cookie", "xxxxxxxxxxxxx");
  return true;
}

summary
Now, I feel that the above method can realize more than 90% of Html parsing c# tutorial For more convenient and fast methods, please go to the API document on the official website to understand the relevant requirements.

The above is c# the details of parsing Html using htmlagility pack

Topics: C#