Question

我在Spark中将多个html文件读入数据帧。我使用自定义udf

将html的元素转换为数据框中的列

val dataset = spark
  .sparkContext
  .wholeTextFiles(inputPath)
  .toDF("filepath", "filecontent")
  .withColumn("biz_name", parseDocValue(".biz-page-title")('filecontent))
  .withColumn("biz_website", parseDocValue(".biz-website a")('filecontent))

  ...

  def parseDocValue(cssSelectorQuery: String) = 
     udf((html: String) => Jsoup.parse(html).select(cssSelectorQuery).text())

哪种方法很完美，但每次withColumn调用都会导致解析html字符串，这是多余的。

有没有办法（不使用查找表等）我可以根据＆＃34; filecontent＆＃34;生成1个已解析的文档（Jsoup.parse(html)）。每行一列，并使其适用于数据框中的所有withColumn次调用？

或者我不应该尝试使用DataFrames并只使用RDD＆＃39;

Answer 1

我可能会按如下方式重写它，进行解析并一次性选择并将它们放在临时列中：

/// <summary>
    /// The method FindCommonItems, returns a list of all the COMMON ITEMS in the lists contained in the listOfLists.
    /// The method expects lists containing NO DUPLICATE ITEMS.
    /// </summary>
    /// <typeparam name="T"></typeparam>
    /// <param name="allSets"></param>
    /// <returns></returns>
    public static List<T> FindCommonItems<T>(IEnumerable<List<T>> allSets)
    {
        Dictionary<T, int> map = new Dictionary<T, int>();
        int listCount = 0; // Number of lists.
        foreach (IEnumerable<T> currentSet in allSets)
        {
            int itemsCount = currentSet.ToList().Count;
            HashSet<T> uniqueItems = new HashSet<T>();
            bool duplicateItemEncountered = false;
            listCount++;
            foreach (T item in currentSet)
            {
                if (!uniqueItems.Add(item))
                {
                    duplicateItemEncountered = true;
                }                        
                if (map.ContainsKey(item))
                {
                    map[item]++;
                } 
                else
                {
                    map.Add(item, 1);
                }
            }
            if (duplicateItemEncountered)
            {
                uniqueItems.Clear();
                List<T> duplicateItems = new List<T>();
                StringBuilder currentSetItems = new StringBuilder();
                List<T> currentSetAsList = new List<T>(currentSet);
                for (int i = 0; i < itemsCount; i++)
                {
                    T currentItem = currentSetAsList[i];
                    if (!uniqueItems.Add(currentItem))
                    {
                        duplicateItems.Add(currentItem);
                    }
                    currentSetItems.Append(currentItem);
                    if (i < itemsCount - 1)
                    {
                        currentSetItems.Append(", ");
                    }
                }
                StringBuilder duplicateItemsNamesEnumeration = new StringBuilder();
                int j = 0;
                foreach (T item in duplicateItems)
                {
                    duplicateItemsNamesEnumeration.Append(item.ToString());
                    if (j < uniqueItems.Count - 1)
                    {
                        duplicateItemsNamesEnumeration.Append(", ");
                    }
                }
                throw new Exception("The list " + currentSetItems.ToString() + " contains the following duplicate items: " + duplicateItemsNamesEnumeration.ToString());
            }
        }
        List<T> result= new List<T>();
        foreach (KeyValuePair<T, int> itemAndItsCount in map)
        {
            if (itemAndItsCount.Value == listCount) // Items whose occurrence count is equal to the number of lists are common to all the lists.
            {
                result.Add(itemAndItsCount.Key);
            }
        }

        return result;
    }

Answer 2

所以最后的答案其实很简单：

只需映射行并在那里创建对象

def docValue(cssSelectorQuery: String, attr: Option[String] = None)(implicit document: Document): Option[String] = {
    val domObject = document.select(cssSelectorQuery)

    val domValue = attr match {
      case Some(a) => domObject.attr(a)
      case None => domObject.text()
    }

    domValue match {
      case x if x == null || x.isEmpty => None
      case y => Some(y)
    }
  }

 val dataset = spark
      .sparkContext
      .wholeTextFiles(inputPath, minPartitions = 265) 
      .map {
        case (filepath, filecontent) => {
          implicit val document = Jsoup.parse(filecontent)

          val customDataJson = docJson(filecontent, customJsonRegex)


          DataEntry(
            biz_name = docValue(".biz-page-title"),
            biz_website = docValue(".biz-website a"),
            url = docValue("meta[property=og:url]", attr = Some("content")),
            ...
            filename = Some(fileName(filepath)),
            fileTimestamp = Some(fileTimestamp(filepath))
          )
        }
      }
      .toDS()

在Apache Spark中每行迭代添加范围变量

2 个答案: